[HADOOP] Google 클라우드 검색 인덱서 "인덱서 : 때 java.io.IOException : 작업이 실패!"
HADOOPGoogle 클라우드 검색 인덱서 "인덱서 : 때 java.io.IOException : 작업이 실패!"
나는 Google 클라우드 플랫폼 제품에 비교적 새로운 오전, 특히 Google 클라우드 검색에 젊은 개발자입니다. 또한 https://developers.google.com/cloud-search/docs/guides/apache-nutch-connector 튜토리얼을 따라하는 것을 시도했다.
내가 한 일이 단순히 같은 nutch-site.xml의 파일을 수정 자습서를 재현하는 것
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|more|metadata)|indexer-google-cloud-search|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
<property>
<name>gcs.config.file</name>
<value>/home/joys/Downloads/apache-nutch-1.14/sdk-configuration.properties</value>
<description>Location of GCS Connector SDK configuration file.</description>
</property>
<property>
<name>gcs.uploadFormat</name>
<value>text</value>
<description></description>
</property>
<property>
<name>fetcher.parse</name>
<value>true</value>
<description></description>
</property>
<property>
<name>http.agent.name</name>
<value>Joy Spider</value>
<description></description>
</property>
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>false</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
<name>fetcher.store.content</name>
<value>true</value>
<description>If true, fetcher will store content.</description>
</property>
<property>
<name>metatags.names</name>
<value>metatag.*</value>
<description>Location of GCS Connector SDK configuration file.</description>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.*</value>
<description>Location of GCS Connector SDK configuration file.</description>
</property>
<property>
<name>index.metadata</name>
<value>metatag.*</value>
<description>Location of GCS Connector SDK configuration file.</description>
</property>
<property>
<name>http.robot.rules.whitelist</name>
<value>*</value>
<description>Location of GCS Connector SDK configuration file.</description>
</property>
</configuration>
그와 같은 sdk-configuration.properties
Required properties for accessing data source
# (These values are created by the admin before running the connector)
api.sourceId=id
# Path to service account credentials
api.serviceAccountPrivateKeyFile=/path/to/.json
#connector.runOnce=true
defaultAcl.mode=FALLBACK
defaultAcl.public=true
api.rootUrl=https://cloudsearch.googleapis.com
# The schema name is read from the data source and used for repository structured data.The default is an empty string.
structuredData.localSchema=schema.json
#The metadata attribute that contains the value corresponding to the document title. The default value is an empty string.
itemMetadata.title.field=title
#The metadata attribute that contains the value for the document URL for search results.
itemMetadata.sourceRepositoryUrl.field=url
#The content language for documents being indexed
itemMetadata.contentLanguage.field=languageCode
#The object type used by the site, as defined in the data source schema object definitions. The connector won't index any structured data if this property is not specified.
Note: This configuration property points to a value rather than a metadata attribute, and the .field and .defaultValue sufffixes are not supported.
itemMetadata.objectType=file
#The metadata attribute that contains the value for the last modification timestamp for the document.
itemMetadata.updatetime.field=updateAt
#The metadata attribute that contains the value for the document creation timestamp.
itemMetadata.createTime.field=updateAt
contentTemplate.templateName.title=filetitle
또한 내가 추가되지 않았다 -addBinaryContent -base64 crawl.sh에서 옵션을 선택합니다. 튜토리얼은 gcs.uploadFormat 매개 변수가 누락 또는 "원시"로 설정되어있는 경우에만이 옵션을 넣어했다. 그리고 난 "텍스트"로 설정
그리고 모든 GCS이 시점에서 나는 항상이 오류가 인덱싱을 시작 권리까지 간다 :
2
018-12-01 21:39:53,368 INFO gcs.GoogleCloudSearchIndexWriter - Starting up!
2018-12-01 21:40:01,002 WARN mapred.LocalJobRunner - job_local1304604211_0001
java.lang.Exception: java.lang.NullPointerException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.NullPointerException
at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.getValueExtractor(StructuredData.java:375)
at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.lambda$new$3(StructuredData.java:294)
at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.<init>(StructuredData.java:294)
at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.lambda$init$1(StructuredData.java:234)
at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.init(StructuredData.java:231)
at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.initFromConfiguration(StructuredData.java:199)
at org.apache.nutch.indexwriter.gcs.GoogleCloudSearchIndexWriter.open(GoogleCloudSearchIndexWriter.java:104)
at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:77)
at org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-12-01 21:40:01,414 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
그것은 crawl.sh 파일의이 라인에서 오류가 발생합니다
apache-nutch-1.14/bin/nutch index crawl-test//crawldb -linkdb crawl-test//linkdb crawl-test//segments/20181201213917
Failed with exit value 255.
명령 인덱스. 나는 아이디어를 마무리하고 나는 내가 그것을 고칠 수있는 방법에 대해 더 이상 단서가 없습니다.
웹 서핑 I 하둡 폴더 나 파일 mapred-site.xml 파일을 찾을 수 있어야 내가 넣어 것을 발견
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
하둡 - 실-site.xml 파일에 넣어
<property>
<name>yarn.nodemanager.aux-services</name>
<value>map reduce_shuffle</value>
</property>
하지만 나를 위해 작동하지 않습니다. 어떤 생각?
해결법
from https://stackoverflow.com/questions/53575201/google-cloud-search-indexer-indexer-java-io-ioexception-job-failed by cc-by-sa and MIT license
'HADOOP' 카테고리의 다른 글
[HADOOP] 넷빈즈 프로파일 러는 50 스레드 제한 후 "작동이 중지?" (0) | 2019.09.21 |
---|---|
[HADOOP] 클라우 데라 매니저와 HDFS-site.xml 파일 (0) | 2019.09.21 |
[HADOOP] 웹 모니터링 문제를 업그레이드 하둡 (0) | 2019.09.21 |
[HADOOP] 아파치 빔 'HDFS에 대한 등록을 찾을 수 없습니다' (0) | 2019.09.21 |
[HADOOP] 뭐 이따위로 압축 폴더 권한을 tmp로 인해 작동하지 (0) | 2019.09.21 |