3 多表操作Join
3.1 数据准备
先构建两个DataFrame
scala> val df1 = spark.createDataset(Seq(("a", 1,2), ("b",2,3) )).toDF("k1","k2","k3") df1: org.apache.spark.sql.DataFrame = [k1: string, k2: int ... 1 more field] scala> val df2 = spark.createDataset(Seq(("a", 2,2), ("b",3,3), ("b", 2,1), ("c", 1,1)) ).toDF("k1","k2","k4") df2: org.apache.spark.sql.DataFrame = [k1: string, k2: int ... 1 more field] scala> df1.show +---+---+---+ | k1| k2| k3| +---+---+---+ | a| 1| 2| | b| 2| 3| +---+---+---+ scala> df2.show +---+---+---+ | k1| k2| k4| +---+---+---+ | a| 2| 2| | b| 3| 3| | b| 2| 1| | c| 1| 1| +---+---+---+
3.2 Join算子说明
join比较通用两种调用方式,注意在usingColumns里的字段必须在两个DF中都存在
joinType:默认是 inner
. 必须是以下类型的一种:inner
, cross
, outer
, full
, full_outer
, left
, left_outer
,right, right_outer, left_semi, left_anti.
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
3.2.1 join内连接
//select * from df1 join df2 on df1.key1=df2.key1 //方法一 scala> df1.join(df2,"k1").show +---+---+---+---+---+ | k1| k2| k3| k2| k4| +---+---+---+---+---+ | a| 1| 2| 2| 2| | b| 2| 3| 2| 1| | b| 2| 3| 3| 3| +---+---+---+---+---+ //方法二 scala> df1.join(df2,df1("k1") === df2("k1")).show +---+---+---+---+---+---+ | k1| k2| k3| k1| k2| k4| +---+---+---+---+---+---+ | a| 1| 2| a| 2| 2| | b| 2| 3| b| 2| 1| | b| 2| 3| b| 3| 3| +---+---+---+---+---+---+ //方法三 scala> df1.join(df2,df1("k1") === df2("k1"),"inner").show +---+---+---+---+---+---+ | k1| k2| k3| k1| k2| k4| +---+---+---+---+---+---+ | a| 1| 2| a| 2| 2| | b| 2| 3| b| 2| 1| | b| 2| 3| b| 3| 3| +---+---+---+---+---+---+ //不同字段比较 //select * from df1 join df2 on df1.key2=df2.key4 scala> df1.join(df2,df1("k2") === df2("k4"),"inner").show +---+---+---+---+---+---+ | k1| k2| k3| k1| k2| k4| +---+---+---+---+---+---+ | a| 1| 2| c| 1| 1| | a| 1| 2| b| 2| 1| | b| 2| 3| a| 2| 2| +---+---+---+---+---+---+ //多个字段比较 //select * from df1 join df2 on df1.key1=df2.key1 and df1.key2=df2.key2 scala> df1.join(df2,Seq("k1","k2"),"inner").show +---+---+---+---+ | k1| k2| k3| k4| +---+---+---+---+ | b| 2| 3| 1| +---+---+---+---+
3.2.2 其他join类型,只需把inner改成你需要的类型即可
scala> df1.join(df2,Seq("k1"),"left").show +---+---+---+---+---+ | k1| k2| k3| k2| k4| +---+---+---+---+---+ | a| 1| 2| 2| 2| | b| 2| 3| 2| 1| | b| 2| 3| 3| 3| +---+---+---+---+---+ //左外连接 left_outer可简写为left scala> df1.join(df2,Seq("k1"),"left_outer").show +---+---+---+---+---+ | k1| k2| k3| k2| k4| +---+---+---+---+---+ | a| 1| 2| 2| 2| | b| 2| 3| 2| 1| | b| 2| 3| 3| 3| +---+---+---+---+---+ //左半连接 scala> df1.join(df2,Seq("k1"),"leftsemi").show +---+---+---+ | k1| k2| k3| +---+---+---+ | a| 1| 2| | b| 2| 3| +---+---+---+ scala> df1.join(df2,Seq("k1","k2"),"left").show +---+---+---+----+ | k1| k2| k3| k4| +---+---+---+----+ | a| 1| 2|null| | b| 2| 3| 1| +---+---+---+----+ scala> df1.join(df2,Seq("k1"),"right").show +---+----+----+---+---+ | k1| k2| k3| k2| k4| +---+----+----+---+---+ | a| 1| 2| 2| 2| | b| 2| 3| 3| 3| | b| 2| 3| 2| 1| | c|null|null| 1| 1| +---+----+----+---+---+
left是left_outer的简写
4 单表操作:列的增删改与空值处理