Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
196 views
in Technique[技术] by (71.8m points)

python - Pyspark Categorical data vectorization with numerical values associated with it

I'm a newbie in Pyspark programming. I need some help.

I have a dataset with a categorical feature and some associated numerical values with it. I would like to vectorize the categorical value including the associated numerical value with it. I have ~3 Million possible values for the categorical data column.

enter image description here

question from:https://stackoverflow.com/questions/65837384/pyspark-categorical-data-vectorization-with-numerical-values-associated-with-it

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can group by UserID and aggregate the Quantity column into an array:

import pyspark.sql.functions as F

df2 = df.groupBy('UserID').agg(F.collect_list('Quantity').alias('Quantity'))

But this may not ensure that the order of fruits remains correct. To achieve that, you can use a more sophisticated method that involves sorting:

df2 = df.groupBy('UserID').agg(
    F.expr("transform(array_sort(collect_list(array(`Fruit Purchased`, Quantity))), x -> x[1]) Quantity")
)

Or you can do a pivot instead, which also ensures order of fruits:

df2 = df.groupBy('UserID').pivot('Fruit Purchased').agg(F.first('Quantity'))
df3 = df2.select('UserID', F.array([c for c in df2.columns[1:]]).alias('Quantity'))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...