하이브에서 모든 N-그램을 생성하는 방법

나는 HiveQL을 사용하여 N-그램의 목록을 작성하고 싶습니다. 내 생각은 내다 및 분할 기능으로 정규식을 사용하는 것이었다 -이, 그래도 작동하지 않습니다

select split('This is my sentence', '(\\S+) +(?=(\\S+))');

입력 폼의 열인

|sentence                 |
|-------------------------|
|This is my sentence      |
|This is another sentence |

출력은 있어야한다 :

["This is","is my","my sentence"]
["This is","is another","another sentence"]

이 하이브에서의 n-g의 UDF는하지만 기능은 직접 N-그램의 주파수를 계산 - 내가 대신 모든 N-그램의 목록을 가지고 싶습니다.

많은 감사드립니다!

해결법

==============================

1.이것은 어쩌면 가장 최적하지만 꽤 작업 해결책이 아니다. 구분 기호로 분할 문장은 다음 폭발 한 후, N- 그램을 얻거나 collect_list (당신이 독특한 N-그램을 필요로하는 경우) collect_set를 사용하여 N-g의 배열을 조립 가입 (내 예에서 하나 또는 그 공백이나 쉼표 이상) :

이것은 어쩌면 가장 최적하지만 꽤 작업 해결책이 아니다. 구분 기호로 분할 문장은 다음 폭발 한 후, N- 그램을 얻거나 collect_list (당신이 독특한 N-그램을 필요로하는 경우) collect_set를 사용하여 N-g의 배열을 조립 가입 (내 예에서 하나 또는 그 공백이나 쉼표 이상) :

with src as 
(
select source_data.sentence, words.pos, words.word
  from
      (--Replace this subquery (source_data) with your table
       select stack (2,
                     'This is my sentence', 
                     'This is another sentence'
                     ) as sentence
      ) source_data 
        --split and explode words
        lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)

select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams 
      from src s1 
           inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos              
group by s1.sentence;

결과:

OK
This is another sentence        ["This is","is another","another sentence"]
This is my sentence             ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)

from https://stackoverflow.com/questions/52782188/how-to-generate-all-n-grams-in-hive by cc-by-sa and MIT license

'HADOOP' 카테고리의 다른 글

[HADOOP] 어떤 부분에서 / 맵리 듀스의 클래스는 작업을 구현 줄일 정지 논리입니다 (0)	2019.09.17
[HADOOP] 하둡과 하나의 기록과 같은 텍스트 파일에서 처리 paraphragraphs (0)	2019.09.17
[HADOOP] 인 IntelliJ를 사용하여 HDFS Kerberos를 클러스터에 로컬로 연결할 수 없습니다 (0)	2019.09.17
[HADOOP] 하이브로 JSON 배열을 가져 (0)	2019.09.17
[HADOOP] 하둡 하이브 바닐라에서이 오류를 수정하는 방법 (0)	2019.09.17

복붙노트

[HADOOP] 하이브에서 모든 N-그램을 생성하는 방법

하이브에서 모든 N-그램을 생성하는 방법

해결법

'HADOOP' 카테고리의 다른 글

티스토리툴바