스파크 스칼라에 DataFrame의 열 이름을 이름 바꾸기

나는 스파크 - 스칼라에서 DataFrame의 모든 헤더 / 열 이름을 변환하려합니다. 지금의 나는 단지 하나의 열 이름을 대체 코드를 다음과 같이 마련.

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

해결법

==============================

1.구조는 평면 인 경우 :

구조는 평면 인 경우 :

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

당신이 할 수있는 가장 간단한 것은 toDF 방법을 사용하는 것입니다 :

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

개별 열 이름을 바꾸려면 당신은 하나를 사용하고 별칭을 선택할 수 있습니다 :

df.select($"_1".alias("x1"))

쉽게 여러 컬럼에 일반화 될 수있다 :

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

또는 withColumnRenamed :

df.withColumnRenamed("_1", "x1")

이는 여러 열 이름을 바꿀 foldLeft와 함께 사용

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

중첩 된 구조를 갖는 (구조체) 가능한 하나의 옵션은 전체 구조를 선택 이름 변경된다 :

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

이 Null 허용 메타 데이터에 영향을 미칠 수 있습니다. 또 다른 가능성은 주조로 이름을 변경하는 것입니다 :

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

또는:

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

==============================
2.PySpark 버전에 관심이 당신의 사람들을 위해 (실제로는 스칼라에서 동일합니다 - 아래에 의견을 참조)

PySpark 버전에 관심이 당신의 사람들을 위해 (실제로는 스칼라에서 동일합니다 - 아래에 의견을 참조)
```
merchants_df_renamed = merchants_df.toDF(
    'merchant_id', 'category', 'subcategory', 'merchant')

merchants_df_renamed.printSchema()
```
결과:
==============================
3.
```
def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}
```
이 경우에있어서 명백한 것은 아니며,이 접두어 현재 열 이름 각각에 접미사를 추가한다. 하나 이상의 열이 같은 이름을 가진 두 개의 테이블을 가지고 있고, 당신이 그들과 합류하지만 여전히 결과 테이블의 컬럼을 명확하게 할 수 있도록하고자 할 때 유용 할 수 있습니다. "정상"SQL에서이 작업을 수행하는 비슷한 방법이 있다면 그것은 확실히 좋은 것입니다.

from https://stackoverflow.com/questions/35592917/renaming-column-names-of-a-dataframe-in-spark-scala by cc-by-sa and MIT license

'SCALA' 카테고리의 다른 글

[SCALA] 왜 개인 val`와`민간 최종 val` 다른`입니까? (0)	2019.11.06
[SCALA] 자바 VM을 통해 메모리 장벽 및 코딩 스타일 (0)	2019.11.06
[SCALA] 케이스 클래스는 스칼라에서지도로 (0)	2019.11.06
[SCALA] 간단한 빌드 도구 (SBT)와 인 IntelliJ와 디버깅 스칼라 코드 (0)	2019.11.06
[SCALA] 어떻게 하드 코딩없이 케이크 패턴 의존성 주입을해야합니까? (0)	2019.11.05

복붙노트

[SCALA] 스파크 스칼라에 DataFrame의 열 이름을 이름 바꾸기

스파크 스칼라에 DataFrame의 열 이름을 이름 바꾸기

해결법

1.구조는 평면 인 경우 :

2.PySpark 버전에 관심이 당신의 사람들을 위해 (실제로는 스칼라에서 동일합니다 - 아래에 의견을 참조)

3.

'SCALA' 카테고리의 다른 글

티스토리툴바