hadoop hdfs에있는 디렉토리와 서브 디렉토리에있는 모든 파일을 나열하는 법

두 개의 하위 폴더가있는 hdfs 폴더가 있습니다. 각 하위 폴더에는 약 30 개의 하위 폴더가 있습니다. 마지막으로 각 하위 폴더에는 xml 파일이 들어 있습니다. 기본 폴더의 경로 만 제공하는 모든 xml 파일을 나열하고 싶습니다. 로컬에서는 아파치 commons-io의 FileUtils.listFiles ()를 사용하여이 작업을 수행 할 수 있습니다. 나는 이것을 시도했다.

FileStatus[] status = fs.listStatus( new Path( args[ 0 ] ) );

그러나 두 개의 첫 번째 하위 폴더 만 나열되며 더 이상 진행되지 않습니다. hadoop에서이 작업을 수행 할 수있는 방법이 있습니까?

해결법

==============================
1.FileSystem 객체를 사용하고 결과 FileStatus 객체에 대한 일부 논리를 수행하여 수동으로 하위 디렉토리로 재귀해야합니다.

FileSystem 객체를 사용하고 결과 FileStatus 객체에 대한 일부 논리를 수행하여 수동으로 하위 디렉토리로 재귀해야합니다.

또한 PathFilter를 적용하여 listStatus (Path, PathFilter) 메서드를 사용하여 XML 파일 만 반환 할 수도 있습니다

hadoop FsShell 클래스는 hadoop fs -lsr 명령에 대한 예제를 가지고 있습니다. 이것은 재귀적인 ls입니다 - 590 행 주위의 소스를 봅니다 (재귀 단계는 635 행에서 트리거됩니다)

==============================

2.hadoop 2. * API를 사용하는 경우보다 세련된 솔루션이 있습니다.

hadoop 2. * API를 사용하는 경우보다 세련된 솔루션이 있습니다.

    Configuration conf = getConf();
    Job job = Job.getInstance(conf);
    FileSystem fs = FileSystem.get(conf);

    //the second boolean parameter here sets the recursion to true
    RemoteIterator<LocatedFileStatus> fileStatusListIterator = fs.listFiles(
            new Path("path/to/lib"), true);
    while(fileStatusListIterator.hasNext()){
        LocatedFileStatus fileStatus = fileStatusListIterator.next();
        //do stuff with the file like ...
        job.addFileToClassPath(fileStatus.getPath());
    }

==============================

3.이것을 시도해 봤니?

이것을 시도해 봤니?

import java.io.*;
import java.util.*;
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class cat{
    public static void main (String [] args) throws Exception{
        try{
            FileSystem fs = FileSystem.get(new Configuration());
            FileStatus[] status = fs.listStatus(new Path("hdfs://test.com:9000/user/test/in"));  // you need to pass in your hdfs path

            for (int i=0;i<status.length;i++){
                BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(status[i].getPath())));
                String line;
                line=br.readLine();
                while (line != null){
                    System.out.println(line);
                    line=br.readLine();
                }
            }
        }catch(Exception e){
            System.out.println("File not found");
        }
    }
}

==============================

4.

/**
 * @param filePath
 * @param fs
 * @return list of absolute file path present in given path
 * @throws FileNotFoundException
 * @throws IOException
 */
public static List<String> getAllFilePath(Path filePath, FileSystem fs) throws FileNotFoundException, IOException {
    List<String> fileList = new ArrayList<String>();
    FileStatus[] fileStatus = fs.listStatus(filePath);
    for (FileStatus fileStat : fileStatus) {
        if (fileStat.isDirectory()) {
            fileList.addAll(getAllFilePath(fileStat.getPath(), fs));
        } else {
            fileList.add(fileStat.getPath().toString());
        }
    }
    return fileList;
}

빠른 예 : 다음과 같은 파일 구조가 있다고 가정 해보십시오.

a  ->  b
   ->  c  -> d
          -> e 
   ->  d  -> f

위의 코드를 사용하면 다음을 얻을 수 있습니다.

a/b
a/c/d
a/c/e
a/d/f

리프 (예 : fileNames) 만 원하면 else 블록에 다음 코드를 사용하십시오.

 ...
    } else {
        String fileName = fileStat.getPath().toString(); 
        fileList.add(fileName.substring(fileName.lastIndexOf("/") + 1));
    }

이것은 줄 것이다 :

b
d
e
f

==============================

5.이제는 Spark를 사용하여 다른 접근 방식 (예 : Hadoop MR)보다 빠르고 똑같이 수행 할 수 있습니다. 다음은 코드 스 니펫입니다.

이제는 Spark를 사용하여 다른 접근 방식 (예 : Hadoop MR)보다 빠르고 똑같이 수행 할 수 있습니다. 다음은 코드 스 니펫입니다.

def traverseDirectory(filePath:String,recursiveTraverse:Boolean,filePaths:ListBuffer[String]) {
    val files = FileSystem.get( sparkContext.hadoopConfiguration ).listStatus(new Path(filePath))
            files.foreach { fileStatus => {
                if(!fileStatus.isDirectory() && fileStatus.getPath().getName().endsWith(".xml")) {                
                    filePaths+=fileStatus.getPath().toString()      
                }
                else if(fileStatus.isDirectory()) {
                    traverseDirectory(fileStatus.getPath().toString(), recursiveTraverse, filePaths)
                }
            }
    }   
}

==============================

6.다음은 특정 HDFS 디렉토리의 파일 수를 계산하는 코드 단편입니다 (특정 ETL 코드에서 사용할 감속기의 수를 결정하는 데이 번호를 사용했습니다). 필요에 맞게 쉽게 수정할 수 있습니다.

다음은 특정 HDFS 디렉토리의 파일 수를 계산하는 코드 단편입니다 (특정 ETL 코드에서 사용할 감속기의 수를 결정하는 데이 번호를 사용했습니다). 필요에 맞게 쉽게 수정할 수 있습니다.

private int calculateNumberOfReducers(String input) throws IOException {
    int numberOfReducers = 0;
    Path inputPath = new Path(input);
    FileSystem fs = inputPath.getFileSystem(getConf());
    FileStatus[] statuses = fs.globStatus(inputPath);
    for(FileStatus status: statuses) {
        if(status.isDirectory()) {
            numberOfReducers += getNumberOfInputFiles(status, fs);
        } else if(status.isFile()) {
            numberOfReducers ++;
        }
    }
    return numberOfReducers;
}

/**
 * Recursively determines number of input files in an HDFS directory
 *
 * @param status instance of FileStatus
 * @param fs instance of FileSystem
 * @return number of input files within particular HDFS directory
 * @throws IOException
 */
private int getNumberOfInputFiles(FileStatus status, FileSystem fs) throws IOException  {
    int inputFileCount = 0;
    if(status.isDirectory()) {
        FileStatus[] files = fs.listStatus(status.getPath());
        for(FileStatus file: files) {
            inputFileCount += getNumberOfInputFiles(file, fs);
        }
    } else {
        inputFileCount ++;
    }

    return inputFileCount;
}

==============================

7.Radu Adrian Moldovan에게 제안에 감사드립니다.

Radu Adrian Moldovan에게 제안에 감사드립니다.

다음은 대기열을 사용하는 구현입니다.

private static List<String> listAllFilePath(Path hdfsFilePath, FileSystem fs)
throws FileNotFoundException, IOException {
  List<String> filePathList = new ArrayList<String>();
  Queue<Path> fileQueue = new LinkedList<Path>();
  fileQueue.add(hdfsFilePath);
  while (!fileQueue.isEmpty()) {
    Path filePath = fileQueue.remove();
    if (fs.isFile(filePath)) {
      filePathList.add(filePath.toString());
    } else {
      FileStatus[] fileStatus = fs.listStatus(filePath);
      for (FileStatus fileStat : fileStatus) {
        fileQueue.add(fileStat.getPath());
      }
    }
  }
  return filePathList;
}

==============================
8.재귀 적 접근법 (heap issues)을 사용하지 마라. :) 대기열을 사용하다

재귀 적 접근법 (heap issues)을 사용하지 마라. :) 대기열을 사용하다
```
queue.add(param_dir)
while (queue is not empty){

  directory=  queue.pop
 - get items from current directory
 - if item is file add to a list (final list)
 - if item is directory => queue.push
}
```
그거 쉽지, 즐겨라!

==============================

9.재귀 적 접근과 비 재귀 적 접근 모두를위한 코드 스 니펫 :

재귀 적 접근과 비 재귀 적 접근 모두를위한 코드 스 니펫 :

//helper method to get the list of files from the HDFS path
public static List<String>
    listFilesFromHDFSPath(Configuration hadoopConfiguration,
                          String hdfsPath,
                          boolean recursive) throws IOException,
                                        IllegalArgumentException
{
    //resulting list of files
    List<String> filePaths = new ArrayList<String>();

    //get path from string and then the filesystem
    Path path = new Path(hdfsPath);  //throws IllegalArgumentException
    FileSystem fs = path.getFileSystem(hadoopConfiguration);

    //if recursive approach is requested
    if(recursive)
    {
        //(heap issues with recursive approach) => using a queue
        Queue<Path> fileQueue = new LinkedList<Path>();

        //add the obtained path to the queue
        fileQueue.add(path);

        //while the fileQueue is not empty
        while (!fileQueue.isEmpty())
        {
            //get the file path from queue
            Path filePath = fileQueue.remove();

            //filePath refers to a file
            if (fs.isFile(filePath))
            {
                filePaths.add(filePath.toString());
            }
            else   //else filePath refers to a directory
            {
                //list paths in the directory and add to the queue
                FileStatus[] fileStatuses = fs.listStatus(filePath);
                for (FileStatus fileStatus : fileStatuses)
                {
                    fileQueue.add(fileStatus.getPath());
                } // for
            } // else

        } // while

    } // if
    else        //non-recursive approach => no heap overhead
    {
        //if the given hdfsPath is actually directory
        if(fs.isDirectory(path))
        {
            FileStatus[] fileStatuses = fs.listStatus(path);

            //loop all file statuses
            for(FileStatus fileStatus : fileStatuses)
            {
                //if the given status is a file, then update the resulting list
                if(fileStatus.isFile())
                    filePaths.add(fileStatus.getPath().toString());
            } // for
        } // if
        else        //it is a file then
        {
            //return the one and only file path to the resulting list
            filePaths.add(path.toString());
        } // else

    } // else

    //close filesystem; no more operations
    fs.close();

    //return the resulting list
    return filePaths;
} // listFilesFromHDFSPath

from https://stackoverflow.com/questions/11342400/how-to-list-all-files-in-a-directory-and-its-subdirectories-in-hadoop-hdfs by cc-by-sa and MIT license

'HADOOP' 카테고리의 다른 글

[HADOOP] apache spark - 파일이 존재하는지 확인 (0)	2019.06.01
[HADOOP] HDFS Java의 기존 파일에 데이터 추가 (0)	2019.06.01
[HADOOP] Spark를 사용하여 HDFS에서 파일을 읽을 수 없습니다. (0)	2019.06.01
[HADOOP] Hadoop 데이터 노드가 NameNode를 찾을 수 없습니다. (0)	2019.06.01
[HADOOP] java.lang.RuntimeException : org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient를 인스턴스화 할 수 없습니다. (0)	2019.06.01

복붙노트

[HADOOP] hadoop hdfs에있는 디렉토리와 서브 디렉토리에있는 모든 파일을 나열하는 법

hadoop hdfs에있는 디렉토리와 서브 디렉토리에있는 모든 파일을 나열하는 법

해결법

1.FileSystem 객체를 사용하고 결과 FileStatus 객체에 대한 일부 논리를 수행하여 수동으로 하위 디렉토리로 재귀해야합니다.

2.hadoop 2. * API를 사용하는 경우보다 세련된 솔루션이 있습니다.

3.이것을 시도해 봤니?

4.

5.이제는 Spark를 사용하여 다른 접근 방식 (예 : Hadoop MR)보다 빠르고 똑같이 수행 할 수 있습니다. 다음은 코드 스 니펫입니다.

6.다음은 특정 HDFS 디렉토리의 파일 수를 계산하는 코드 단편입니다 (특정 ETL 코드에서 사용할 감속기의 수를 결정하는 데이 번호를 사용했습니다). 필요에 맞게 쉽게 수정할 수 있습니다.

7.Radu Adrian Moldovan에게 제안에 감사드립니다.

8.재귀 적 접근법 (heap issues)을 사용하지 마라. :) 대기열을 사용하다

9.재귀 적 접근과 비 재귀 적 접근 모두를위한 코드 스 니펫 :

'HADOOP' 카테고리의 다른 글

티스토리툴바