분산 캐시를 사용하여 하둡에 Maxmind 지오 API에 액세스

나는 웹 로그를 분석하기 위해 맵리 듀스 작업을 쓰고 있어요. 내 코드는 위치를 지리적하기 위해 IP 주소를지도하기위한 것입니다 그리고 난 그 목적을 위해 Maxmind 지오 API (https://github.com/maxmind/geoip-api-java)를 사용하고 있습니다. 내 코드 위치 matchings에 IP와 데이터베이스 파일을 필요로하는 LookupService 방법이있다. 내가 분산 캐시를 사용하여이 데이터베이스 파일을 전달하려합니다. 나는 두 가지 방법으로이 일을 시도

사례 1 :

HDFS에서 파일을 통과하는 작업을 실행하지만 항상 "파일을 찾을 수 없습니다"라는 오류가 발생합니다

sudo -u hdfs hadoop jar \
 WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \
/user/hdfs/GeoLiteCity.dat

또는

sudo -u hdfs hadoop jar \
WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \
hdfs://sandbox.hortonworks.com:8020/user/hdfs/GeoLiteCity.dat

드라이버 클래스 코드 :

Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.addCacheFile(new Path(args[2]).toUri());

매퍼 클래스 코드 :

public void setup(Context context) throws IOException
{
URI[] uriList = context.getCacheFiles();
Path database_path = new Path(uriList[0].toString());
LookupService cl = new LookupService(database_path.toString(),
            LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

CASE 2 : 이 -files 옵션을 통해 로컬 파일 시스템에서 파일을 전달하여 코드를 실행합니다. 오류 : 라인 LookupService의 CL = 새로운 LookupService에서 널 포인터 예외 (database_path)

sudo -u hdfs hadoop jar  \
WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
com.prithvi.mapreduce.logprocessing.ipgeo.GeoLocationDatasetDriver \
-files /tmp/jobs/GeoLiteCity.dat /user/hdfs/input /user/hdfs/out_put \
GeoLiteCity.dat

드라이버 코드 :

Configuration conf = getConf();
Job job = Job.getInstance(conf);
String dbfile = args[2];
conf.set("maxmind.geo.database.file", dbfile);

매퍼 코드 :

public void setup(Context context) throws IOException
{
  Configuration conf = context.getConfiguration();
  String database_path = conf.get("maxmind.geo.database.file");
  LookupService cl = new LookupService(database_path,
            LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

나는 작업을 수행하기 위해 내 모든 작업 추적기에서이 데이터베이스 파일이 필요합니다. 어떤 사람은 나에게 그렇게 할 수있는 올바른 방법을 제안시겠습니까?

해결법

==============================
1.이 일을보십시오 :

이 일을보십시오 :

HDFS의 파일이 너무 작업 개체를 사용하여 같은 곳 드라이버에서 지정
```
job.addCacheFile(new URI("hdfs://localhot:8020/GeoLite2-City.mmdb#GeoLite2-City.mmdb"));
```
# 별칭 이름 (심볼릭 링크)를 나타냅니다 하둡에 의해 생성 될

그 후 당신은) (설정에 매퍼에서 방법을 파일에 액세스 할 수 있습니다 :
```
@Override
protected void setup(Context context) {
  File file = new File("GeoLite2-City.mmdb");
}
```
다음은 그 예이다 :

from https://stackoverflow.com/questions/25193145/accessing-maxmind-geo-api-in-hadoop-using-distributed-cache by cc-by-sa and MIT license

'HADOOP' 카테고리의 다른 글

[HADOOP] 아파치 돼지 UDF를 통해 자바 스크립트 파일을 읽기 (0)	2019.10.15
[HADOOP] 일치하는 튜플 값에 합류 하둡 돼지 (0)	2019.10.15
[HADOOP] Giraph의 추정 클러스터 힙 4,096메가바이트 요청은 0메가바이트의 현재 가용성 클러스터 힙보다 더 크다. 작업 중단 (0)	2019.10.15
[HADOOP] 누락 된 의존성 하이브 - 내장 매크로의 원인은 Oozie에 대한 오류 코드 410로 실패를 구축 (0)	2019.10.15
[HADOOP] 하둡 스트리밍 파일 읽기 (0)	2019.10.15

복붙노트

[HADOOP] 분산 캐시를 사용하여 하둡에 Maxmind 지오 API에 액세스

분산 캐시를 사용하여 하둡에 Maxmind 지오 API에 액세스

해결법

1.이 일을보십시오 :

'HADOOP' 카테고리의 다른 글

티스토리툴바