apache spark - PySpark- iteratively and conditionally compute median, fill NAs

Question

Welcome To Ask or Share your Answers For Others

apache spark - PySpark- iteratively and conditionally compute median, fill NAs

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - PySpark- iteratively and conditionally compute median, fill NAs

I have a PySpark dataframe:

values = [('Lacoste', 'Red', 6, 30), ('Gap', 'Orange', 8, None), ('Lacoste', 'Green', 5, 200),
         ('Gap', 'Red', 3, None), ('Gap', 'Orange', 5, None), ('Lacoste', 'Green', 3, 150),
         ('Lacoste', 'Orange', 9, 40), ('Lacoste', 'Red', 4, 70), ('Gap', 'Green', None, 15),
         ('Lacoste', 'Red', None, 50), ('Gap', 'Orange', 5, 17), ('Lacoste', 'Green', None, 40),
         ('Banana Republic', 'Orange', None, None)]
ratings = spark.createDataFrame(values, ['Brand', 'Color', 'Rating', 'Price'])
ratings.show()

#+---------------+------+------+-----+
#|          Brand| Color|Rating|Price|
#+---------------+------+------+-----+
#|        Lacoste|   Red|     6|   30|
#|            Gap|Orange|     8| null|
#|        Lacoste| Green|     5|  200|
#|            Gap|   Red|     3| null|
#|            Gap|Orange|     5| null|
#|        Lacoste| Green|     3|  150|
#|        Lacoste|Orange|     9|   40|
#|        Lacoste|   Red|     4|   70|
#|            Gap| Green|  null|   15|
#|        Lacoste|   Red|  null|   50|
#|            Gap|Orange|     5|   17|
#|        Lacoste| Green|  null|   40|
#|Banana Republic|Orange|  null| null|
#+---------------+------+------+-----+

EDITED: I would like to fill all NAs with the median based on brand and color, then based on brand alone- the result would be such that the only row with remaining null values would be the Banana Republic row (as there is no brand or brand/color combo for Banana Republic). The first answer nearly gets me there, but as you can see, I mistakenly hard-coded a column name- I would like this to iterate through the list of column names.

# Assign median based on the brand and color combination
median_columns = ['Rating','Price']
median_columns = ['Rating','Price']
for item in median_columns:
    brand_window = Window.partitionBy('Brand')
    brand_color_window = Window.partitionBy('Brand','Color')
    brand_color_median = f.expr("percentile_approx('item', 0.5)")
    ratings = ratings.withColumn(item, 
                      f.coalesce(item,
                                 brand_color_median.over(brand_color_window),
                                 brand_color_median.over(brand_window)))

ratings.show()

#+---------------+------+------+-----+
#|          Brand| Color|Rating|Price|
#+---------------+------+------+-----+
#|            Gap| Green|  null| 15.0|
#|            Gap|Orange|   5.0| null|
#|            Gap|Orange|   5.0| 17.0|
#|            Gap|Orange|   8.0| null|
#|            Gap|   Red|   3.0| null|
#|        Lacoste| Green|   5.0|200.0|
#|        Lacoste| Green|  null| 40.0|
#|        Lacoste| Green|   3.0|150.0|
#|        Lacoste|Orange|   9.0| 40.0|
#|        Lacoste|   Red|  null| 50.0|
#|        Lacoste|   Red|   6.0| 30.0|
#|        Lacoste|   Red|   4.0| 70.0|
#|Banana Republic|Orange|  null| null|
#+---------------+------+------+-----+

The nulls don't get overwritten- what am I missing?

question from:https://stackoverflow.com/questions/65924104/pyspark-iteratively-and-conditionally-compute-median-fill-nas

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:07:31+0000

You need to coalesce the original column with the median column, so that you won't overwrite the original column if it is not null.

median_columns = ['Rating','Price']
for item in median_columns:
    brand_window = Window.partitionBy('Brand')
    brand_color_window = Window.partitionBy('Brand','Color')
    brand_color_median = F.expr(f'percentile_approx({item}, 0.5)')
    ratings = ratings.withColumn(item, 
                      F.coalesce(item,
                                 brand_color_median.over(brand_color_window),
                                 brand_color_median.over(brand_window)))

ratings.show()
+---------------+------+------+-----+
|          Brand| Color|Rating|Price|
+---------------+------+------+-----+
|            Gap| Green|     5|   15|
|            Gap|Orange|     8|    5|
|            Gap|Orange|     5|    5|
|            Gap|Orange|     5|   17|
|            Gap|   Red|     3|    3|
|        Lacoste| Green|     5|  200|
|        Lacoste| Green|     3|  150|
|        Lacoste| Green|     3|   40|
|        Lacoste|Orange|     9|   40|
|        Lacoste|   Red|     6|   30|
|        Lacoste|   Red|     4|   70|
|        Lacoste|   Red|     4|   50|
|Banana Republic|Orange|  null| null|
+---------------+------+------+-----+

Categories

apache spark - PySpark- iteratively and conditionally compute median, fill NAs

apache spark - PySpark- iteratively and conditionally compute median, fill NAs

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags