아파치 돼지에서 리눅스 'diff'와 동등

두 개의 큰 파일에서 표준 diff를 수행 할 수 있기를 원합니다. 작동 할 수있는 것이 있지만 명령 행에서 diff만큼 빠르지는 않습니다.

A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';

누구든지이 일을 할 수있는 더 좋은 방법이 있습니까?

해결법

==============================

1.나는 다음 접근법을 사용한다. (My JOIN 접근법은 매우 비슷하지만이 방법은 diff의 동작을 복제 된 행으로 복제하지 않습니다.) Pig가 0.8의 감속기 수를 조정하는 알고리즘을 가지고 있기 때문에 이것이 언젠가 전에 물어 보았을 때, 아마도 감속기 하나만 사용했을 것입니다.

나는 다음 접근법을 사용한다. (My JOIN 접근법은 매우 비슷하지만이 방법은 diff의 동작을 복제 된 행으로 복제하지 않습니다.) Pig가 0.8의 감속기 수를 조정하는 알고리즘을 가지고 있기 때문에 이것이 언젠가 전에 물어 보았을 때, 아마도 감속기 하나만 사용했을 것입니다.

SET job.name 'Diff(1) Via Join'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;

-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;

-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
                    second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

SET job.name 'Diff(1)'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;

a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;

-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;

-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'

counts = FOREACH c_group {
             firsts = FILTER combined BY File == 1;
             seconds = FILTER combined BY File == 2;
             GENERATE
                FLATTEN(
                        (COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
                            (COUNT(firsts) - COUNT(seconds) > 0 ?
                                TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
                                TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
                        )
                ) AS (Row, File); };

-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
                  second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

공연

from https://stackoverflow.com/questions/5907950/equivalent-of-linux-diff-in-apache-pig by cc-by-sa and MIT license

'HADOOP' 카테고리의 다른 글

[HADOOP] OOZIE 쉘 노드의 쉘 스크립트에 Jar 파일을 전달하는 방법 (0)	2019.07.30
[HADOOP] MapReduce 프레임 워크는 정렬 단계를 어떻게 구현합니까? (0)	2019.07.30
[HADOOP] DynamoDB InputFormat for Hadoop (0)	2019.07.29
[HADOOP] hadoop 데이터 노드를 정상적으로 시작할 수 없습니다. (0)	2019.07.29
[HADOOP] 태스크 제한 시간 때문에 Sqoop 가져 오기 작업이 실패합니다. (0)	2019.07.29

복붙노트

[HADOOP] 아파치 돼지에서 리눅스 'diff'와 동등

아파치 돼지에서 리눅스 'diff'와 동등

해결법

'HADOOP' 카테고리의 다른 글

티스토리툴바