복붙노트

[HADOOP] Apache Pig-가방을 읽을 수 없음

HADOOP

Apache Pig-가방을 읽을 수 없음

아래와 같이 PIG를 사용하여 쉼표로 구분 된 데이터를 읽으려고합니다.

grunt> cat script/pig/emp_tuple1.txt
1,kirti,250000,{(100),(200)}
2,kk,240000,{(100),(300)}
3,kumar,200000,{(200),(400)}
4,shinde,290000,{(200),(500),(300),(100)}
5,shinde k y,260000,{(100),(300),(200)}
6,amol,255000,{(300)}
grunt> emp_t1 = load 'script/pig/emp_tuple1.txt' using PigStorage(',') as (empno:int, ename:chararray, salary:int, dlist:bag{});
grunt> dump emp_t1;
2015-11-23 12:26:44,450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,kirti,250000,)   
(2,kk,240000,)
(3,kumar,200000,)
(4,shinde,290000,)
(5,shinde k y,260000,)
(6,amol,255000,{(300)})

그 사이에 다음과 같은 경고가 표시됩니다.

2015-11-23 12:26:44,173 [LocalJobRunner Map Task Executor #0] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.Utf8StorageConverter(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to interpret value [123, 40, 49, 48, 48, 41] in field being converted to type bag, caught ParseException <Unexpect end of bag> field discarded

가방에 쉼표 (,)가 있으면 경고가 표시되는 것 같습니다.

이제 내가 한 일은 : 쉼표를 탭 (또는 다른 구분 기호)으로 변경하면 효과가 있습니다.

grunt> cat script/pig/emp_tuple2.txt;
1|kirti|250000|{(100),(200)}
2|kk|240000|{(100),(300)}
3|kumar|200000|{(200),(400)}
4|shinde|290000|{(200),(500),(300),(100)}
5|shinde k y|260000|{(100),(300),(200)}
6|amol|255000|{(300)}
grunt> emp_t2 = load 'script/pig/emp_tuple2.txt' using PigStorage('|') as (empno:int, ename:chararray, salary:int, dlist:bag{});
grunt> dump emp_t1;
2015-11-23 12:31:33,408 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,kirti,250000,{(100),(200)})
(2,kk,240000,{(100),(300)})
(3,kumar,200000,{(200),(400)})
(4,shinde,290000,{(200),(500),(300),(100)})
(5,shinde k y,260000,{(100),(300),(200)})
(6,amol,255000,{(300)})

쉼표로 구분 된 백으로 쉼표로 구분 된 데이터가 있는지 궁금합니다. 작동하지 않습니까?

해결법

  1. ==============================

    1.

    Lets go into details, 
     1. Data is being read as TextInputFormat 
     2. Line Record Reader is being used to read lines
     3. , is being used to separate columns. 
    
    as "," occurs in the bag and is the delimeter across columns, bag is being split into multiple columns. 
    
    There are various way to overcome this. 
    
     1. pre-process the input and replace first three "," in each row by some other delimeter. 
    
  2. from https://stackoverflow.com/questions/33865424/apache-pig-not-able-to-read-the-bag by cc-by-sa and MIT license