Hadoop에서 분할 감속기 출력

내 Reduce 작업으로 생성 된 출력 파일이 너무 큽니다 (Gzipping 후 1GB). 200MB의 작은 파일로 나누기 출력을 원합니다. 출력을 크기별로 또는 아니오로 분할하는 속성 / 자바 클래스가 있습니까? 라인 감속기의 수를 늘릴 수는 없습니다. 이것이 hadoop 작업의 성능에 부정적인 영향을 미치기 때문입니다.

해결법

==============================

1.왜 더 많은 감속기를 사용할 수 없는지 궁금하지만, 나는 당신을 당신의 말로 데려 갈 것입니다.

왜 더 많은 감속기를 사용할 수 없는지 궁금하지만, 나는 당신을 당신의 말로 데려 갈 것입니다.

수행 할 수있는 한 가지 옵션은 MultipleOutputs를 사용하고 하나의 감속기에서 여러 파일에 쓰는 것입니다. 예를 들어, 각 감속기의 출력 파일이 1GB이고 대신 256MB 파일을 원한다고 가정하십시오. 이것은 하나의 파일이 아니라 리듀서 당 4 개의 파일을 작성해야 함을 의미합니다.

작업 드라이버에서 다음을 수행하십시오.

JobConf conf = ...;

// You should probably pass this in as parameter rather than hardcoding 4.
conf.setInt("outputs.per.reducer", 4);

// This sets up the infrastructure to write multiple files per reducer.
MultipleOutputs.addMultiNamedOutput(conf, "multi", YourOutputFormat.class, YourKey.class, YourValue.class);

감속기에서 다음을 수행하십시오.

@Override
public void configure(JobConf conf) {
  numFiles = conf.getInt("outputs.per.reducer", 1);
  multipleOutputs = new MultipleOutputs(conf);

  // other init stuff
  ...
}

@Override
public void reduce(YourKey key
                   Iterator<YourValue> valuesIter,
                   OutputCollector<OutKey, OutVal> ignoreThis,
                   Reporter reporter) {
    // Do your business logic just as you're doing currently.
    OutKey outputKey = ...;
    OutVal outputVal = ...;

    // Now this is where it gets interesting. Hash the value to find
    // which output file the data should be written to. Don't use the
    // key since all the data will be written to one file if the number
    // of reducers is a multiple of numFiles.
    int fileIndex = (outputVal.hashCode() & Integer.MAX_VALUE) % numFiles;

    // Now use multiple outputs to actually write the data.
    // This will create output files named: multi_0-r-00000, multi_1-r-00000,
    // multi_2-r-00000, multi_3-r-00000 for reducer 0. For reducer 1, the files
    // will be multi_0-r-00001, multi_1-r-00001, multi_2-r-00001, multi_3-r-00001.
    multipleOutputs.getCollector("multi", Integer.toString(fileIndex), reporter)
      .collect(outputKey, outputValue);
}

@Overrider
public void close() {
   // You must do this!!!!
   multipleOutputs.close();
}

이 의사 코드는 이전 mapreduce api를 염두에두고 작성되었습니다. mapreduce api를 사용하여 동등한 api가 존재하므로 어느 쪽이든 설정해야합니다.

==============================
2.이 작업을 수행 할 속성이 없습니다. 자체 출력 형식 및 레코드 작성기를 작성해야합니다.

이 작업을 수행 할 속성이 없습니다. 자체 출력 형식 및 레코드 작성기를 작성해야합니다.

from https://stackoverflow.com/questions/10436811/splitting-reducer-output-in-hadoop by cc-by-sa and MIT license

'HADOOP' 카테고리의 다른 글

[HADOOP] HBase에서 스캔을 되돌릴 때 startKey와 stopKey는 무엇입니까? (0)	2019.09.10
[HADOOP] 환경 변수를 Hive Transform 또는 MapReduce로 전달 (0)	2019.09.10
[HADOOP] hadoop 2.5.0이 데이터 노드를 시작하지 못했습니다 (0)	2019.09.10
[HADOOP] 돼지 파일에서 .jar를 사용하는 방법 (0)	2019.09.10
[HADOOP] Fiware Cosmos Hive 인증 문제 (0)	2019.09.10

복붙노트

[HADOOP] Hadoop에서 분할 감속기 출력

Hadoop에서 분할 감속기 출력

해결법

1.왜 더 많은 감속기를 사용할 수 없는지 궁금하지만, 나는 당신을 당신의 말로 데려 갈 것입니다.

2.이 작업을 수행 할 속성이 없습니다. 자체 출력 형식 및 레코드 작성기를 작성해야합니다.

'HADOOP' 카테고리의 다른 글

티스토리툴바