단어 수 파일의 일반적인 단어

나는 비 분산 모드에서 Hadoop의 단어 수 예제를 실행하는 관리 적이; 나는 "부분-00000"라는 이름의 파일에 출력을 얻을; 나는 그것이 결합 된 모든 입력 파일의 모든 단어를 나열 것을 볼 수 있습니다.

단어 수 코드를 추적 한 후 나는 라인을 소요 공간을 기반으로 단어를 분할하는 것을 볼 수 있습니다.

난 그냥 여러 파일과 사건에서 발생한 단어를 나열하는 방법을 생각하려고? 이지도에서 얻을 수 / 축소? -Added- 이러한 변경이 적절합니까?

      //changes in the parameters here

    public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {

         // These are the original line; I am not using them but left them here...
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

                    //My changes are here too

        private Text outvalue=new Text();
        FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
        private String filename = fileSplit.getPath().getName();;



      public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());

          //    And here        
              outvalue.set(filename);
          output.collect(word, outvalue);

        }

      }

    }

해결법

==============================
1.당신은 키를 누른 다음 단어가 어디에서 왔는지의 파일 이름을 나타내는 값으로 텍스트 A와 출력에 단어를 매퍼을 개정 할 수있다. 그런 다음 감속기에, 당신은 단지 파일 이름 및 출력 단어가 하나의 파일보다 더에 표시하는 항목을 DEDUP해야합니다.

당신은 키를 누른 다음 단어가 어디에서 왔는지의 파일 이름을 나타내는 값으로 텍스트 A와 출력에 단어를 매퍼을 개정 할 수있다. 그런 다음 감속기에, 당신은 단지 파일 이름 및 출력 단어가 하나의 파일보다 더에 표시하는 항목을 DEDUP해야합니다.

처리중인 파일의 이름은 당신이 새로운 API를 사용하거나하지 않는 여부에 따라 달라집니다 얻으려면 (mapred 또는 맵리 듀스 패키지 이름). 난 당신이 getInputSplit 방법 (다음 아마 당신은 TextInputFormat를 사용하는 가정하는 FileSplit에 InputSplit을 구분)를 사용하여 Context 객체에서 매퍼 입력 분할을 추출 할 수있는 새로운 API에 대해 알고있다. 기존의 API를 들어, 나는 그것을 시도 적이 있지만, 분명히 당신은 구성 등록 정보라는 map.input.file을 사용할 수 있습니다

같은 매퍼에서 여러 단어의 발생을 DEDUP하는 - 이것은 또한 결합기를 도입하기위한 좋은 선택이 될 것입니다.

최신 정보

그래서 문제에 대한 응답으로, 다음과 같이 개정, 매퍼의 클래스 scopt에 존재하지 않는 기자라는 인스턴스 변수를 사용하려는 :
```
public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
  // These are the original line; I am not using them but left them here...
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  //My changes are here too
  private Text outvalue=new Text();
  private String filename = null;

  public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
    if (filename == null) {
      filename = ((FileSplit) reporter.getInputSplit()).getPath().getName();
    }

    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());

      //    And here        
      outvalue.set(filename);
      output.collect(word, outvalue);
    }
  }
}
```
(왜 SO 위의 서식을 존중하지 않는 정말 확실하지 ...)

from https://stackoverflow.com/questions/10086818/wordcount-common-words-of-files by cc-by-sa and MIT license

'HADOOP' 카테고리의 다른 글

[HADOOP] fs.defaultFS 속성을 설정하는 경우 Dataproc에서 클러스터를 만들 수 없습니다? (0)	2019.10.01
[HADOOP] 클라이언트를 만들 수 없습니다 - 하이브와 실행 엔진과 같은 스파크 (0)	2019.10.01
[HADOOP] 같은 아파치 스파크 스파크 쉘 쇼 서로 다른 데이터베이스 직선 및 수 있습니까? (0)	2019.10.01
[HADOOP] 맵리 듀스 작업을 기본 호출하지 않는 하이브 명령에 대한 (0)	2019.10.01
[HADOOP] 오류가 하둡에 항아리를 실행하는 동안 (0)	2019.09.30

복붙노트

[HADOOP] 단어 수 파일의 일반적인 단어

단어 수 파일의 일반적인 단어

해결법

'HADOOP' 카테고리의 다른 글

티스토리툴바