Convert null values to empty array in Spark DataFrame

Question

Welcome To Ask or Share your Answers For Others

Convert null values to empty array in Spark DataFrame

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Convert null values to empty array in Spark DataFrame

I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.

I thought I could do it like so:

val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )

However, this results in the following exception:

java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)

Apparently array types are not supported by the when function. Is there some other easy way to convert the null values?

In case it is relevant, here is the schema for this column:

|-- myCol: array (nullable = true)
|    |-- element: integer (containsNull = false)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-16T23:34:24+0000

You can use an UDF:

import org.apache.spark.sql.functions.udf

val array_ = udf(() => Array.empty[Int])

combined with WHEN or COALESCE:

df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show

In the recent versions you can use array function:

import org.apache.spark.sql.functions.{array, lit}

df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show

Please note that it will work only if conversion from string to the desired type is allowed.

The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType

def empty_array(t):
    return udf(lambda: [], ArrayType(t()))()

coalesce(myCol, empty_array(IntegerType()))

and in the recent versions just use array:

from pyspark.sql.functions import array

coalesce(myCol, array().cast("array<integer>"))

Categories

Convert null values to empty array in Spark DataFrame

Convert null values to empty array in Spark DataFrame

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags