복붙노트

[HADOOP] Windows의 Nutch : 오류 크롤링 ​​인젝터

HADOOP

Windows의 Nutch : 오류 크롤링 ​​인젝터

cygwin64 2.874 기반 Windows 2012 서버에 nutch 1.12를 설치하려고합니다. Java 및 Linux의 제한된 기술로 인해 https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Seeding_the_crawldb_with_a_list_of_URLs에서 단계별 소개를 수행했습니다. 명령

 bin/nutch inject crawl/crawldb urls

winutils.exe를 찾을 수 없어 오류가 발생합니다. hadoop 로그는 다음과 같습니다.

2016-07-01 09:22:25,660 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
    at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:432)
    at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:478)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
    at org.apache.nutch.crawl.Injector.main(Injector.java:441)
2016-07-01 09:22:25,832 INFO  crawl.Injector - Injector: starting at 2016-07-01 09:22:25
2016-07-01 09:22:25,832 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2016-07-01 09:22:25,832 INFO  crawl.Injector - Injector: urlDir: urls
2016-07-01 09:22:25,832 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2016-07-01 09:22:26,050 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-01 09:22:26,394 ERROR crawl.Injector - Injector: java.lang.NullPointerException
    at java.lang.ProcessBuilder.start(Unknown Source)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
    at org.apache.hadoop.util.Shell.run(Shell.java:418)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
    at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
    at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
    at org.apache.nutch.crawl.Injector.run(Injector.java:467)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:441)

여기에 몇 가지 힌트가 있으며 https://codeload.github.com/srccodes/hadoop-common-2.2.0-bin/zip/master에서 winutils.exe를 다운로드했습니다. 서버에서 폴더의 압축을 풀고 환경 변수 NUTCH_OPTS = -Dhadoop.home.dir = [winutils_folder]를 설정했습니다. 이제 winutils 오류는 사라졌지 만 nutch 호출은 다른 오류로 실패합니다.

2016-07-01 13:24:09,697 INFO  crawl.Injector - Injector: starting at 2016-07-01 13:24:09
2016-07-01 13:24:09,697 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2016-07-01 13:24:09,697 INFO  crawl.Injector - Injector: urlDir: urls
2016-07-01 13:24:09,697 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2016-07-01 13:24:09,885 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-01 13:24:10,307 ERROR crawl.Injector - Injector: org.apache.hadoop.util.Shell$ExitCodeException: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
    at org.apache.hadoop.util.Shell.run(Shell.java:418)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
    at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
    at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
    at org.apache.nutch.crawl.Injector.run(Injector.java:467)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:441)

.bashrc (다음 줄 추가)를 업데이트 한 후 hadoop 로그는 경고 만 표시합니다.

export JAVA_HOME='/cygdrive/c/Program Files/Java/jre1.8.0_92'
export PATH=$PATH:$JAVA_HOME/bin

그러나 nutch는 여전히 오류를 발생시킵니다.

Injector: starting at 2016-07-01 17:25:22
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:570)
        at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
        at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:173)
        at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:160)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:94)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
        at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:131)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:163)
        at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:376)
        at org.apache.nutch.crawl.Injector.run(Injector.java:467)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:441)

무엇이 잘못 구성되었거나 Windows / cygwin으로 nutch를 실행할 수 없는지 힌트가 필요합니까?

해결법

  1. ==============================

    1.UnsatisfiedLinkError의 원인을 찾는 데 약 일주일을 보냈습니다.

    UnsatisfiedLinkError의 원인을 찾는 데 약 일주일을 보냈습니다.

    주된 이유는 java가 종속성을 찾을 수 없기 때문이라고 생각합니다.

    nutch 파일에서 일부 스크립트를 주석 처리 하여이 문제를 해결했습니다.

    ({nutch_folder} \ runtime \ local \ bin에 있음)

    좀 더 구체적으로, -Djava.library.path 설정 스크립트를 주석 처리했습니다.

    대신 원래 java.library.path를 사용하도록 Java 프로그램을 만들었습니다.

    아래 명령을 사용하여 환경에서 원래 java.library.path를 확인하십시오.

    java -XshowSettings:properties -version
    

    java.library.path = ...를 포함한 많은 속성을 보여줍니다.

    아래 코드를 주석으로 처리하십시오.

    # setup 'java.library.path' for native-hadoop code if necessary
    # used only in local mode
    JAVA_LIBRARY_PATH=''
    if [ -d "${NUTCH_HOME}/lib/native" ]; then
    
      JAVA_PLATFORM=`"${JAVA}" -classpath "$CLASSPATH" org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'`
    
      if [ -d "${NUTCH_HOME}/lib/native" ]; then
        if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
          JAVA_LIBRARY_PATH="${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}"
        else
          JAVA_LIBRARY_PATH="${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}"
        fi
      fi
    fi
    
    if [ $cygwin = true -a "X${JAVA_LIBRARY_PATH}" != "X" ]; then
       JAVA_LIBRARY_PATH="`cygpath -p -w "$JAVA_LIBRARY_PATH"`"
    fi
    

    아래 코드도 주석 처리하십시오.

    if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
      NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Djava.library.path="$JAVA_LIBRARY_PATH")
    fi
    

    이 작업을 수행하면 NUTCH_OPTS에 -Djava.library.path가 포함되지 않으므로 java는 원래 설정을 사용합니다.

    nutch 명령이 (의사) 분산 모드에서와 같이 실행 중입니다. 왜 그런지 모르겠습니다.

    이것이 도움이되기를 바랍니다.

  2. from https://stackoverflow.com/questions/38147431/nutch-on-windows-error-crawl-injector by cc-by-sa and MIT license