어떻게 JSON 파일에 액세스 하위 기관에?

나는이 같은 JSON 파일 모양을 가지고 :

{
  "employeeDetails":{
    "name": "xxxx",
    "num":"415"
  },
  "work":[
    {
      "monthYear":"01/2007",
      "workdate":"1|2|3|....|31",
      "workhours":"8|8|8....|8"
    },
    {
      "monthYear":"02/2007",
      "workdate":"1|2|3|....|31",
      "workhours":"8|8|8....|8"
    }
  ]
}

나는 workdate를 얻을 수 있고,이 JSON 데이터에서 workhours.

나는이 같은 시도 :

import org.apache.spark.{SparkConf, SparkContext}

object JSON2 {
  def main (args: Array[String]) {
    val spark =
      SparkSession.builder()
        .appName("SQL-JSON")
        .master("local[4]")
        .getOrCreate()

    import spark.implicits._

    val employees = spark.read.json("sample.json")
    employees.printSchema()
    employees.select("employeeDetails").show()
  }
}

나는이 같은 예외를 받고 있습니다 :

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`employeeDetails`' given input columns: [_corrupt_record];;
'Project ['employeeDetails]
+- Relation[_corrupt_record#0] json

나는 불꽃에 새로운 오전.

해결법

==============================

1.그 이유는있는 그 스파크가 지원하는 JSON 파일입니다 "각 라인은 별도의 독립적 인 유효한 JSON 객체를 포함해야합니다."

그 이유는있는 그 스파크가 지원하는 JSON 파일입니다 "각 라인은 별도의 독립적 인 유효한 JSON 객체를 포함해야합니다."

JSON 데이터 집합을 인용 :

경우에, JSON 파일은 (당신이 columnNameOfCorruptRecord 옵션을 사용하여 변경할 수) _corrupt_record 아래에 저장됩니다 스파크 올바르지 않습니다.

scala> spark.read.json("employee.json").printSchema
root
 |-- _corrupt_record: string (nullable = true)

그리고 당신의 파일 만이 여러 줄 JSON하지만 JQ (가볍고 유연한 명령 줄 JSON 프로세서) 이렇게 말합니다도 있기 때문에 bacause하지 올바르지 않습니다.

$ cat incorrect.json
{
  "employeeDetails":{
    "name": "xxxx",
    "num:"415"
  }
  "work":[
  {
    "monthYear":"01/2007"
    "workdate":"1|2|3|....|31",
    "workhours":"8|8|8....|8"
  },
  {
    "monthYear":"02/2007"
    "workdate":"1|2|3|....|31",
    "workhours":"8|8|8....|8"
  }
  ],
}
$ cat incorrect.json | jq
parse error: Expected separator between values at line 4, column 14

당신이 JSON 파일을 수정하면 멀티 라인 JSON 파일을로드하기 위해 다음과 같은 트릭을 사용합니다.

scala> spark.version
res5: String = 2.1.1

val employees = spark.read.json(sc.wholeTextFiles("employee.json").values)
scala> employees.printSchema
root
 |-- employeeDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- num: string (nullable = true)
 |-- work: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- monthYear: string (nullable = true)
 |    |    |-- workdate: string (nullable = true)
 |    |    |-- workhours: string (nullable = true)

scala> employees.select("employeeDetails").show()
+---------------+
|employeeDetails|
+---------------+
|     [xxxx,415]|
+---------------+

스파크 2.2 (아주 최근에 출시 된 매우 사용 권장)로, 대신 여러 옵션을 사용해야합니다. 여러 옵션은 JSON과 CSV에 대한 여러 줄에 wholeFile SPARK-20980 이름 바꾸기에 옵션이 추가되었습니다.

scala> spark.version
res0: String = 2.2.0

scala> spark.read.option("multiLine", true).json("employee.json").printSchema
root
 |-- employeeDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- num: string (nullable = true)
 |-- work: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- monthYear: string (nullable = true)
 |    |    |-- workdate: string (nullable = true)
 |    |    |-- workhours: string (nullable = true)

from https://stackoverflow.com/questions/44814926/how-to-access-sub-entities-in-json-file by cc-by-sa and MIT license

'SCALA' 카테고리의 다른 글

[SCALA] 구문 분석 명령 줄 매개 변수에 가장 좋은 방법은? [닫은] (0)	2019.11.18
[SCALA] 카산드라 쿼리 감소 성능의 경고를주고, 준비된 문장을 여러 번 사용 (0)	2019.11.18
[SCALA] 스칼라 유형 매개 변수 오류, 형식 매개 변수의 회원이 아니 (0)	2019.11.18
[SCALA] 스파크 스칼라 앱에 대한 동일한 dataframe에서 날짜 컬럼에 일 칼럼의 번호 추가 (0)	2019.11.18
[SCALA] 자바에서 scala.None 액세스 (0)	2019.11.18

복붙노트

[SCALA] 어떻게 JSON 파일에 액세스 하위 기관에?

어떻게 JSON 파일에 액세스 하위 기관에?

해결법

1.그 이유는있는 그 스파크가 지원하는 JSON 파일입니다 "각 라인은 별도의 독립적 인 유효한 JSON 객체를 포함해야합니다."

'SCALA' 카테고리의 다른 글

티스토리툴바