pyspark 사용, hadoop 파일 시스템에서 2D 이미지 읽기 / 쓰기

hdfs 파일 시스템에서 이미지를 읽고 쓰고 hdfs 지역을 활용할 수 있기를 원합니다.

각 이미지가 구성되는 이미지 모음이 있습니다.

hdfs 파일 시스템을 통해 아카이브를 만들고 아카이브 분석을 위해 spark를 사용하고 싶습니다. 지금은 스파크 + hdfs 구조를 최대한 활용할 수 있도록 hdfs 파일 시스템을 통해 데이터를 저장하는 가장 좋은 방법을 고민하고 있습니다.

내가 이해하는 한 가장 좋은 방법은 sequenceFile 랩퍼를 작성하는 것입니다. 두 가지 질문이 있습니다.

해결법

==============================

1.나는 작동하는 해결책을 찾았습니다 : pyspark 1.2.0 바이너리 파일을 사용하면 작업이 가능합니다. 실험적인 것으로 표시되었지만 openCV의 적절한 조합으로 tiff 이미지를 읽을 수있었습니다.

나는 작동하는 해결책을 찾았습니다 : pyspark 1.2.0 바이너리 파일을 사용하면 작업이 가능합니다. 실험적인 것으로 표시되었지만 openCV의 적절한 조합으로 tiff 이미지를 읽을 수있었습니다.

import cv2
import numpy as np

# build rdd and take one element for testing purpose
L = sc.binaryFiles('hdfs://localhost:9000/*.tif').take(1)

# convert to bytearray and then to np array
file_bytes = np.asarray(bytearray(L[0][1]), dtype=np.uint8)

# use opencv to decode the np bytes array 
R = cv2.imdecode(file_bytes,1)

pyspark의 도움에 유의하십시오.

binaryFiles(path, minPartitions=None)

    :: Experimental

    Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

    Note: Small files are preferred, large file is also allowable, but may cause bad performance.

from https://stackoverflow.com/questions/28731140/using-pyspark-read-write-2d-images-on-hadoop-file-system by cc-by-sa and MIT license

'HADOOP' 카테고리의 다른 글

[HADOOP] 파일을 날짜로 분할하여 kafka에서 hdfs로 가장 효율적으로 작성하는 방법은 무엇입니까? (0)	2019.06.16
[HADOOP] 하이브의 델타 / 증분로드 (0)	2019.06.15
[HADOOP] 네임 스페이스 이미지 및 로그 편집 (0)	2019.06.15
[HADOOP] 메모리의 하둡 환원제 가치? (0)	2019.06.15
[HADOOP] 하이브 - LIKE 연산자 (0)	2019.06.15

복붙노트

[HADOOP] pyspark 사용, hadoop 파일 시스템에서 2D 이미지 읽기 / 쓰기

pyspark 사용, hadoop 파일 시스템에서 2D 이미지 읽기 / 쓰기

해결법

1.나는 작동하는 해결책을 찾았습니다 : pyspark 1.2.0 바이너리 파일을 사용하면 작업이 가능합니다. 실험적인 것으로 표시되었지만 openCV의 적절한 조합으로 tiff 이미지를 읽을 수있었습니다.

'HADOOP' 카테고리의 다른 글

티스토리툴바