Hive (get_json_object)를 사용하여 구조체 배열을 쿼리하는 방법?

하이브 테이블에 다음 JSON 개체를 저장합니다.

{
  "main_id": "qwert",
  "features": [
    {
      "scope": "scope1",
      "name": "foo",
      "value": "ab12345",
      "age": 50,
      "somelist": ["abcde","fghij"]
    },
    {
      "scope": "scope2",
      "name": "bar",
      "value": "cd67890"
    },
    {
      "scope": "scope3",
      "name": "baz",
      "value": [
        "A",
        "B",
        "C"
      ]
    }
  ]
}

'기능'은 다양한 길이의 배열입니다. 즉 모든 객체는 선택 사항입니다. 객체는 임의의 요소를 가지고 있지만 모두 "범위", "이름"및 "값"을 포함합니다.

이것은 내가 만든 하이브 테이블입니다.

CREATE TABLE tbl(
main_id STRING,features array<struct<scope:STRING,name:STRING,value:array<STRING>,age:INT,somelist:array<STRING>>>
)

main_id와 구조체의 값을 "baz"라는 이름으로 반환하는 하이브 쿼리가 필요합니다. 즉,

main_id baz_value
qwert ["A","B","C"]

내 문제는 하이브 UDF "get_json_object"가 JSONPath의 제한된 버전 만 지원한다는 것입니다. get_json_object와 같은 경로를 지원하지 않습니다 (features, '$ .features [? (@. name ='baz ')]).

Hive로 원하는 결과를 어떻게 쿼리 할 수 있습니까? 다른 Hive 테이블 구조가 더 쉬울 수도 있습니까?

해결법

==============================
1.이 문제에 대한 해결책을 찾았습니다.

이 문제에 대한 해결책을 찾았습니다.

Hive를 사용하여 UDTF를 분해하여 구조체 배열을 분해합니다. 즉, 배열 "features"의 각 구조체에 대해 하나의 레코드가있는 두 번째 (임시) 테이블을 만듭니다.
```
CREATE TABLE tbl_exploded as
select main_id, 
f.name as f_name,
f.value as f_value
from tbl
LATERAL VIEW explode(features) exploded_table as f
-- optionally filter here instead of in 2nd query:
-- where f.name = 'baz'; 
```
결과는 다음과 같습니다.
```
qwert, foo, ab12345
qwert, bar, cd67890
qwert, baz, ["A","B","C"]
```
이제 main_id와 value를 다음과 같이 선택할 수 있습니다 :
```
select main_id, f_value from tbl_exploded where f_name = 'baz';
```

==============================

2.이건 괜찮을거야.

이건 괜찮을거야.

Json을 경로로 파싱하기

ADD JAR your-path/ParseJsonWithPath.jar;
CREATE TEMPORARY FUNCTION parseJsonWithPath AS 'com.ntc.hive.udf.ParseJsonWithPath';

SELECT parseJsonWithPath(jsonStr, xpath) FROM ....

구문 분석 할 필드는 json 문자열 (jsonStr) 일 수 있으며, xpath가 주어지면 원하는 것을 얻을 수 있습니다.

예를 들어

jsonStr
{ "book": [
    {
        "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
    },
    {
        "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
   }
}

xpath
"$.book" 
        return the insider json string [....]
"$.book[?(@.price < 10)]" 
        return the [8.95]

자세한 세부 사항

==============================

3.아래에 붙여 넣은 UDF는 사용자의 요구에 가깝습니다. 배열 , 문자열 및 정수를 사용합니다. String은 "name"의 필드 이름이고, 세 번째 인수는 일치시킬 값입니다. 현재는 정수를 기대하지만 이것을 목적에 맞게 문자열 / 텍스트로 변경하는 것이 상대적으로 쉽습니다.

아래에 붙여 넣은 UDF는 사용자의 요구에 가깝습니다. 배열 , 문자열 및 정수를 사용합니다. String은 "name"의 필드 이름이고, 세 번째 인수는 일치시킬 값입니다. 현재는 정수를 기대하지만 이것을 목적에 맞게 문자열 / 텍스트로 변경하는 것이 상대적으로 쉽습니다.

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.lazy.LazyString;
import org.apache.hadoop.hive.serde2.lazy.LazyLong;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableConstantIntObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableConstantStringObjectInspector;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import java.util.ArrayList;
import org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyLongObjectInspector;

@Description(name = "extract_value",
    value = "_FUNC_( array< struct<value:string> > ) - Collect all \"value\" field values inside an array of struct(s), and return the results in an array<string>",
    extended = "Example:\n SELECT _FUNC_(array_of_structs_with_value_field)")
public class StructFromArrayStructDynamicInt
        extends GenericUDF
{
    private ArrayList ret;

    private ListObjectInspector listOI;
    private StructObjectInspector structOI;
    private ObjectInspector indOI;
    private ObjectInspector valOI;
    private ObjectInspector arg1OI;
    private ObjectInspector arg2OI;

    private String indexName;

    WritableConstantStringObjectInspector element1OI;
    WritableConstantIntObjectInspector element2OI;

    @Override
    public ObjectInspector initialize(ObjectInspector[] args)
            throws UDFArgumentException
    {
        if (args.length != 3) {
            throw new UDFArgumentLengthException("The function extract_value() requires exactly three arguments.");
        }

        if (args[0].getCategory() != Category.LIST) {
            throw new UDFArgumentTypeException(0, "Type array<struct> is expected to be the argument for extract_value but " + args[0].getTypeName() + " is found instead");
        }
        if (args[1].getCategory() != Category.PRIMITIVE) {
            throw new UDFArgumentTypeException(0, "Second argument is expected to be primitive but " + args[1].getTypeName() + " is found instead");
        }
        if (args[2].getCategory() != Category.PRIMITIVE) {
            throw new UDFArgumentTypeException(0, "Second argument is expected to be primitive but " + args[2].getTypeName() + " is found instead");
        }

        listOI = ((ListObjectInspector) args[0]);
        structOI = ((StructObjectInspector) listOI.getListElementObjectInspector());
        arg1OI = (StringObjectInspector) args[1];
        arg2OI = args[2];

        this.element1OI = (WritableConstantStringObjectInspector) arg1OI;
        this.element2OI = (WritableConstantIntObjectInspector) arg2OI;

        indexName = element1OI.getWritableConstantValue().toString();

//        if (structOI.getAllStructFieldRefs().size() != 2) {
//            throw new UDFArgumentTypeException(0, "Incorrect number of fields in the struct, should be one");
//        }

//        StructField valueField = structOI.getStructFieldRef("value");
        StructField indexField = structOI.getStructFieldRef(indexName);
        //If not, throw exception
//        if (valueField == null) {
//            throw new UDFArgumentTypeException(0, "NO \"value\" field in input structure");
//        }

        if (indexField == null) {
            throw new UDFArgumentTypeException(0, "Index field not in input structure");
        }

        //Are they of the correct types?
        //We store these object inspectors for use in the evaluate() method
//        valOI = valueField.getFieldObjectInspector();
        indOI = indexField.getFieldObjectInspector();

        //First are they primitives
//        if (valOI.getCategory() != Category.PRIMITIVE) {
//            throw new UDFArgumentTypeException(0, "value field must be of primitive type");
//        }
        if (indOI.getCategory() != Category.PRIMITIVE) {
            throw new UDFArgumentTypeException(0, "index field must be of primitive type");
        }
        if (arg1OI.getCategory() != Category.PRIMITIVE) {
            throw new UDFArgumentTypeException(0, "second argument must be primitive type");
        }
        if (arg2OI.getCategory() != Category.PRIMITIVE) {
            throw new UDFArgumentTypeException(0, "third argument must be primitive type");
        }

        //Are they of the correct primitives?
//        if (((PrimitiveObjectInspector)valOI).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
//            throw new UDFArgumentTypeException(0, "value field must be of string type");
//        }
        if (((PrimitiveObjectInspector)indOI).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.LONG) {
            throw new UDFArgumentTypeException(0, "index field must be of long type");
        }
        if (((PrimitiveObjectInspector)arg1OI).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
            throw new UDFArgumentTypeException(0, "second arg must be of string type");
        }
        if (((PrimitiveObjectInspector)arg2OI).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.INT) {
            throw new UDFArgumentTypeException(0, "third arg must be of int type");
        }

//        ret = new ArrayList();
        return listOI.getListElementObjectInspector();
//        return PrimitiveObjectInspectorFactory.javaStringObjectInspector;
//        return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
    }

    @Override
    public Object evaluate(DeferredObject[] arguments)
            throws HiveException
    {
//        ret.clear();

        if (arguments.length != 3) {
            return null;
        }

        if (arguments[0].get() == null) {
        return null;
        }

        int numElements = listOI.getListLength(arguments[0].get());
//        long xl = argOI.getPrimitiveJavaObject(arguments[1].get());
//        long xl = arguments[1].get(); //9;
        long xl2 = element2OI.get(arguments[2].get());
//        String xl1 = element1OI.getPrimitiveJavaObject(arguments[2].get());

//        long xl = 9;

        for (int i = 0; i < numElements; i++) {
//            LazyString valDataObject = (LazyString) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef("value")));
            long indValue = (Long) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef(indexName)));
//            throw new HiveException("second arg must be of string type");
//            LazyString indDataObject = (LazyString) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef("index")));
//            Text valueValue = ((StringObjectInspector) valOI).getPrimitiveWritableObject(valDataObject);
//            LongWritable indValue = ((LazyLongObjectInspector) indOI).getPrimitiveWritableObject(indDataObject);

            if(indValue == xl2) {
                return listOI.getListElement(arguments[0].get(), i);
            }

//            ret.add(valueValue);
       }
        return null;
    }

    @Override
    public String getDisplayString(String[] strings) {
        assert (strings.length > 0);
        StringBuilder sb = new StringBuilder();
        sb.append("extract_value(");
        sb.append(strings[0]);
        sb.append(")");
        return sb.toString();
    }
}

여기에이 코드와 배열 을 사용하여 작업을 수행하는 다른 작업 udfs가 있습니다.

from https://stackoverflow.com/questions/28716165/how-to-query-struct-array-with-hive-get-json-object by cc-by-sa and MIT license

'HADOOP' 카테고리의 다른 글

[HADOOP] 하이브 실행 훅 (0)	2019.07.15
[HADOOP] core-site.xml에서 fs.default.name / fs.defaultFS 값을 완전히 존중하지 않는 하이브 (0)	2019.07.15
[HADOOP] Hive에서 url 쿼리 문자열을 여러 키 - 값 쌍으로 구문 분석하는 방법 (0)	2019.07.14
[HADOOP] 케르베로스 란 무엇입니까? (0)	2019.07.14
[HADOOP] FSDataInputStream에서 FileInputStream으로 변환 (0)	2019.07.14

복붙노트

[HADOOP] Hive (get_json_object)를 사용하여 구조체 배열을 쿼리하는 방법?

Hive (get_json_object)를 사용하여 구조체 배열을 쿼리하는 방법?

해결법

1.이 문제에 대한 해결책을 찾았습니다.

2.이건 괜찮을거야.

'HADOOP' 카테고리의 다른 글

티스토리툴바