You can group by UserID and aggregate the Quantity column into an array:
import pyspark.sql.functions as F
df2 = df.groupBy('UserID').agg(F.collect_list('Quantity').alias('Quantity'))
But this may not ensure that the order of fruits remains correct. To achieve that, you can use a more sophisticated method that involves sorting:
df2 = df.groupBy('UserID').agg(
F.expr("transform(array_sort(collect_list(array(`Fruit Purchased`, Quantity))), x -> x[1]) Quantity")
)
Or you can do a pivot instead, which also ensures order of fruits:
df2 = df.groupBy('UserID').pivot('Fruit Purchased').agg(F.first('Quantity'))
df3 = df2.select('UserID', F.array([c for c in df2.columns[1:]]).alias('Quantity'))
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…