Welcome To Ask or Share your Answers For Others

python - Pyspark Categorical data vectorization with numerical values associated with it

Welcome To Ask or Share your Answers For Others

1 Answer

answered Oct 7, 2021 by 深蓝 (71.8m points)

You can group by UserID and aggregate the Quantity column into an array:

import pyspark.sql.functions as F

df2 = df.groupBy('UserID').agg(F.collect_list('Quantity').alias('Quantity'))

But this may not ensure that the order of fruits remains correct. To achieve that, you can use a more sophisticated method that involves sorting:

df2 = df.groupBy('UserID').agg(
    F.expr("transform(array_sort(collect_list(array(`Fruit Purchased`, Quantity))), x -> x[1]) Quantity")
)

Or you can do a pivot instead, which also ensures order of fruits:

df2 = df.groupBy('UserID').pivot('Fruit Purchased').agg(F.first('Quantity'))
df3 = df2.select('UserID', F.array([c for c in df2.columns[1:]]).alias('Quantity'))

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

...