【Spark】Spark SQL 数据类型转换-阿里云开发者社区

【Spark】Spark SQL 数据类型转换

2022-06-11 1744

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 【Spark】Spark SQL 数据类型转换

前言

数据类型转换这个在任何语言框架中都会涉及到，看起来非常简单，不过要把所有的数据类型都掌握还是需要一定的时间历练。

SparkSQL数据类型

数字类型

ByteType：代表一个字节的整数。范围是-128到127

ShortType：代表两个字节的整数。范围是-32768到32767

IntegerType：代表4个字节的整数。范围是-2147483648到2147483647

LongType：代表8个字节的整数。范围是-9223372036854775808到9223372036854775807

FloatType：代表4字节的单精度浮点数

DoubleType：代表8字节的双精度浮点数

DecimalType：代表任意精度的10进制数据。通过内部的java.math.BigDecimal支持。BigDecimal由一个任意精度的整型非标度值和一个32位整数组成

StringType：代表一个字符串值

BinaryType：代表一个byte序列值

BooleanType：代表boolean值

Datetime类型

TimestampType：代表包含字段年，月，日，时，分，秒的值

DateType：代表包含字段年，月，日的值

复杂类型

ArrayType(elementType, containsNull)：代表由elementType类型元素组成的序列值。containsNull用来指明ArrayType中的值是否有null值

MapType(keyType, valueType, valueContainsNull)：表示包括一组键 - 值对的值。通过keyType表示key数据的类型，通过valueType表示value数据的类型。valueContainsNull用来指明MapType中的值是否有null值

StructType(fields):表示一个拥有StructFields (fields)序列结构的值

StructField(name, dataType, nullable):代表StructType中的一个字段，字段的名字通过name指定，dataType指定field的数据类型，nullable表示字段的值是否有null值。

Spark SQL数据类型和Scala数据类型对比

Spark SQL数据类型转换案例

一句话描述：调用Column类的cast方法

如何获取Column类

这个之前写过

df("columnName")            // On a specific `df` DataFrame.
col("columnName")           // A generic column not yet associated with a DataFrame.
col("columnName.field")     // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName"               // Scala short hand for a named column.

测试数据准备

1,tom,23
2,jack,24
3,lily,18
4,lucy,19

spark入口代码

val spark = SparkSession
      .builder()
      .appName("test")
      .master("local[*]")
      .getOrCreate()

测试默认数据类型

spark.read.
      textFile("./data/user")
      .map(_.split(","))
      .map(x => (x(0), x(1), x(2)))
      .toDF("id", "name", "age")
      .dtypes
      .foreach(println)

结果：

(id,StringType)
(name,StringType)
(age,StringType)

说明默认都是StringType类型

把数值型的列转为IntegerType

import spark.implicits._
    spark.read.
      textFile("./data/user")
      .map(_.split(","))
      .map(x => (x(0), x(1), x(2)))
      .toDF("id", "name", "age")
      .select($"id".cast("int"), $"name", $"age".cast("int"))
      .dtypes
      .foreach(println)

结果：

(id,IntegerType)
(name,StringType)
(age,IntegerType)

Column类cast方法的两种重载

第一种

def cast(to: String): Column

Casts the column to a different data type, using the canonical string representation of the type. The supported types are:

string, boolean, byte, short, int, long, float, double, decimal, date, timestamp.

// Casts colA to integer.
df.select(df("colA").cast("int"))
Since
1.3.0

第二种

def cast(to: DataType): Column

Casts the column to a different data type.

// Casts colA to IntegerType.
import org.apache.spark.sql.types.IntegerType
df.select(df("colA").cast(IntegerType))
// equivalent to
df.select(df("colA").cast("int"))