복붙노트

[HADOOP] Elephantbird에서는 HDFS에서 데이터로드가 작동하지 않습니다.

HADOOP

Elephantbird에서는 HDFS에서 데이터로드가 작동하지 않습니다.

돼지에서 elephantbird로 데이터를 처리하려고하는데 데이터를로드하는 데 성공하지 못합니다. 여기 내 돼지 스크립트입니다 :

register 'lib/elephant-bird-core-3.0.9.jar';
register 'lib/elephant-bird-pig-3.0.9.jar';
register 'lib/google-collections-1.0.jar';
register 'lib/json-simple-1.1.jar';

twitter = LOAD 'statuses.log.2013-04-01-00' 
          USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

DUMP twitter;

내가 얻는 결과는

[main] INFO  org.apache.pig.Main - Apache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21
[main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop1/twitter_test/pig_1374834826168.log
[main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop1/.pigbootup not found
[main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master.hadoop:8020
[main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: master.hadoop:8021
[main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
[main] WARN  org.apache.pig.backend.hadoop23.PigJobControl - falling back to default JobControl (not using hadoop 0.23 ?)
java.lang.NoSuchFieldException: jobsInProgress
    at java.lang.Class.getDeclaredField(Class.java:1938)
    at org.apache.pig.backend.hadoop23.PigJobControl.<clinit>(PigJobControl.java:58)
    at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl(HadoopShims.java:102)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:285)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177)
    at org.apache.pig.PigServer.launchPlan(PigServer.java:1266)
    at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251)
    at org.apache.pig.PigServer.storeEx(PigServer.java:933)
    at org.apache.pig.PigServer.store(PigServer.java:900)
    at org.apache.pig.PigServer.openIterator(PigServer.java:813)
    at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
    at org.apache.pig.Main.run(Main.java:604)
    at org.apache.pig.Main.main(Main.java:157)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
[main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=656085089
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job6015425922938886053.jar
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job6015425922938886053.jar created
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
[JobControl] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
[JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
[JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 5
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201307261031_0050
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases twitter
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: twitter[10,10] C:  R: 
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://master.hadoop:50030/jobdetails.jsp?jobid=job_201307261031_0050
[main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201307261031_0050 has failed! Stop running all dependent jobs
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
[main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
[main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
2.0.0-cdh4.3.0  0.11.0-cdh4.3.0 hadoop1 2013-07-26 12:33:48 2013-07-26 12:34:23 UNKNOWN

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201307261031_0050   twitter MAP_ONLY    Message: Job failed!    hdfs://master.hadoop:8020/tmp/temp971280905/tmp1376631504,

Input(s):
Failed to read data from "hdfs://master.hadoop:8020/user/hadoop1/statuses.log.2013-04-01-00"

Output(s):
Failed to produce result in "hdfs://master.hadoop:8020/tmp/temp971280905/tmp1376631504"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201307261031_0050


[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
Details at logfile: /home/hadoop1/twitter_test/pig_1374834826168.log

파일이 존재하며 액세스 할 수 있습니다.

$ hdfs dfs -ls /user/hadoop1/statuses.log.2013-04-01-00
Found 1 items
-rw-r--r--   3 hadoop1 supergroup  656085089 2013-07-26 11:53 /user/hadoop1/statuses.log.2013-04-01-00

Cloudera 4.6.0과 함께 제공되는 돼지 버전의 일반적인 문제점 인 것 같습니다. 문제는

[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected

데이터로드를 위해 다른 사용자 정의 함수를 실행할 때 비슷한 오류가 발생합니다.

[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

돼지를 강제로 로컬 모드 ( '-x local' ')로 만들면 더 명백한 오류가 발생합니다.

Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

그래서 Hadoop 돼지 사용 버전은 Cloudera와 함께 제공되는 버전과 호환되지 않는 것으로 보입니다.

해결법

  1. ==============================

    1.이것은 실제로 버전 관리 문제입니다. 일부 라이브러리는 새로운 MapReduce API와 아직 호환되지 않습니다 (예 : # 56, # 247 및 # 308 문제 참조). ElephantBird의 경우이 문제는 최근 버전에서 해결되었습니다. 위의 코드에서 ElephantBird 4.1을 사용하고 Hadoop 호환 모듈을 추가합니다.

    이것은 실제로 버전 관리 문제입니다. 일부 라이브러리는 새로운 MapReduce API와 아직 호환되지 않습니다 (예 : # 56, # 247 및 # 308 문제 참조). ElephantBird의 경우이 문제는 최근 버전에서 해결되었습니다. 위의 코드에서 ElephantBird 4.1을 사용하고 Hadoop 호환 모듈을 추가합니다.

    register 'lib/elephant-bird-core-4.1.jar';
    register 'lib/elephant-bird-pig-4.1.jar';
    register 'lib/elephant-bird-hadoop-compat-4.1.jar';
    register 'lib/google-collections-1.0.jar';
    register 'lib/json-simple-1.1.jar';
    

    문제를 해결했습니다! :-)

  2. from https://stackoverflow.com/questions/17879259/loading-data-from-hdfs-does-not-work-with-elephantbird by cc-by-sa and MIT license