开发者社区> 问答> 正文

flattern scala数组类型列到多列

他们是否有可能在Scala DF中展平阵列?

正如我所知,使用列并选择filed.a可行,但我不想手动指定它们。

df.printSchema()
|-- client_version: string (nullable = true)
|-- filed: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: string (nullable = true)
最后的df

df.printSchema()

 |-- client_version: string (nullable = true)
 |-- filed_a: string (nullable = true)
 |-- filed_b: string (nullable = true)
 |-- filed_c: string (nullable = true)
 |-- filed_d: string (nullable = true)

展开
收起
社区小助手 2018-12-19 16:55:14 2226 0
1 条回答
写回答
取消 提交回答
  • 社区小助手是spark中国社区的管理员,我会定期更新直播回顾等资料和文章干货,还整合了大家在钉群提出的有关spark的问题及回答。

    您可以使用blast平展ArrayType列,并将嵌套的结构元素名称映射到所需的顶级列名,如下所示:
    import org.apache.spark.sql.functions._

    case class S(a: String, b: String, c: String, d: String)

    val df = Seq(
    ("1.0", Seq(S("a1", "b1", "c1", "d1"))),
    ("2.0", Seq(S("a2", "b2", "c2", "d2"), S("a3", "b3", "c3", "d3")))
    ).toDF("client_version", "filed")

    df.printSchema
    // root
    // |-- client_version: string (nullable = true)
    // |-- filed: array (nullable = true)
    // | |-- element: struct (containsNull = true)
    // | | |-- a: string (nullable = true)
    // | | |-- b: string (nullable = true)
    // | | |-- c: string (nullable = true)
    // | | |-- d: string (nullable = true)

    val dfFlattened = df.withColumn("filed_element", explode($"filed"))

    val structElements = dfFlattened.select($"filed_element.*").columns

    val dfResult = dfFlattened.select( col("client_version") +: structElements.map(

    c => col(s"filed_element.$c").as(s"filed_$c")

    ): _*
    )

    dfResult.show
    // +--------------+-------+-------+-------+-------+
    // |client_version|filed_a|filed_b|filed_c|filed_d|
    // +--------------+-------+-------+-------+-------+
    // | 1.0| a1| b1| c1| d1|
    // | 2.0| a2| b2| c2| d2|
    // | 2.0| a3| b3| c3| d3|
    // +--------------+-------+-------+-------+-------+

    dfResult.printSchema
    // root
    // |-- client_version: string (nullable = true)
    // |-- filed_a: string (nullable = true)
    // |-- filed_b: string (nullable = true)
    // |-- filed_c: string (nullable = true)
    // |-- filed_d: string (nullable = true)


    用于explode通过添加更多行来展平数组,然后select使用*符号将struct列重新置于顶部。

    import org.apache.spark.sql.functions.{collect_list, explode, struct}
    import spark.implicits._

    val df = Seq(("1", "a", "a", "a"),
    ("1", "b", "b", "b"),
    ("2", "a", "a", "a"),
    ("2", "b", "b", "b"),
    ("2", "c", "c", "c"),
    ("3", "a", "a","a")).toDF("idx", "A", "B", "C")
    .groupBy(("idx"))
    .agg(collect_list(struct("A", "B", "C")).as("nested_col"))

    df.printSchema()
    // root
    // |-- idx: string (nullable = true)
    // |-- nested_col: array (nullable = true)
    // | |-- element: struct (containsNull = true)
    // | | |-- A: string (nullable = true)
    // | | |-- B: string (nullable = true)
    // | | |-- C: string (nullable = true)

    df.show
    // +---+--------------------+
    // |idx| nested_col|
    // +---+--------------------+
    // | 3| [[a, a, a]]|
    // | 1|[[a, a, a], [b, b...|
    // | 2|[[a, a, a], [b, b...|
    // +---+--------------------+

    val dfExploded = df.withColumn("exploded", explode($"nested_col")).drop("nested_col")

    dfExploded.show
    // +---+---------+
    // |idx| exploded|
    // +---+---------+
    // | 3|[a, a, a]|
    // | 1|[a, a, a]|
    // | 1|[b, b, b]|
    // | 2|[a, a, a]|
    // | 2|[b, b, b]|
    // | 2|[c, c, c]|
    // +---+---------+

    val finalDF = dfExploded.select("idx", "exploded.*")

    finalDF.show
    // +---+---+---+---+
    // |idx| A| B| C|
    // +---+---+---+---+
    // | 3| a| a| a|
    // | 1| a| a| a|
    // | 1| b| b| b|
    // | 2| a| a| a|
    // | 2| b| b| b|
    // | 2| c| c| c|
    // +---+---+---+---+

    2019-07-17 23:23:03
    赞同 展开评论 打赏
问答分类:
问答标签:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
Just Enough Scala for Spark 立即下载
JDK8新特性与生产-for“华东地区scala爱好者聚会” 立即下载
低代码开发师(初级)实战教程 立即下载