I have the following three dataframes
val df_1 = spark.sparkContext.parallelize(Seq(
("FIDs", "123456")
)).toDF("subComponentName", "FID_HardVer")
val df_2 = spark.sparkContext.parallelize(Seq(
("CLDs", "123456")
)).toDF("subComponentName", "CLD_HardVer")
val df_3 = spark.sparkContext.parallelize(Seq(
("ANYs", "123456")
)).toDF("subComponentName", "ANY_HardVer")
I want to write a function that return a dataframe which adds a column named HardVer with the content of either FID_HardVer, CLD_HardVer, or ANY_HardVer.
Example output would look like this:
df_1
+----------------+-----------+-------+
|subComponentName|FID_HardVer|HardVer|
+----------------+-----------+-------+
| FIDs| 123456| 123456|
+----------------+-----------+-------+
df_2:
+----------------+-----------+-------+
|subComponentName|CLD_HardVer|HardVer|
+----------------+-----------+-------+
| CLDs| 123456| 123456|
+----------------+-----------+-------+
This is the code that I tried up unil now but it seems like spark can't handle this type of request since it validates the column even if the condition does not fit.
def addHardVer(spark: SparkSession, df: DataFrame) : DataFrame = {
import spark.implicits._
val df_withHardVer = df
.withColumn("HardVer",
when($"subComponentName" === "FIDs", $"FID_HardVer")
.when($"subComponentName" === "CLDs", $"CLD_HardVer")
.when($"subComponentName" === "ANYs", $"ANY_HardVer")
.otherwise(lit("unknown"))
)
return df_withHardVer
}
This throws an exception
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'CLD_HardVer
' given input columns: [subComponentName, FID_HardVer];;
question from:
https://stackoverflow.com/questions/66050330/how-can-i-select-a-column-dependent-of-a-different-columns-content-or-the-name-o 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…