I have a PySpark dataframe:
values = [('Lacoste', 'Red', 6, 30), ('Gap', 'Orange', 8, None), ('Lacoste', 'Green', 5, 200),
('Gap', 'Red', 3, None), ('Gap', 'Orange', 5, None), ('Lacoste', 'Green', 3, 150),
('Lacoste', 'Orange', 9, 40), ('Lacoste', 'Red', 4, 70), ('Gap', 'Green', None, 15),
('Lacoste', 'Red', None, 50), ('Gap', 'Orange', 5, 17), ('Lacoste', 'Green', None, 40),
('Banana Republic', 'Orange', None, None)]
ratings = spark.createDataFrame(values, ['Brand', 'Color', 'Rating', 'Price'])
ratings.show()
#+---------------+------+------+-----+
#| Brand| Color|Rating|Price|
#+---------------+------+------+-----+
#| Lacoste| Red| 6| 30|
#| Gap|Orange| 8| null|
#| Lacoste| Green| 5| 200|
#| Gap| Red| 3| null|
#| Gap|Orange| 5| null|
#| Lacoste| Green| 3| 150|
#| Lacoste|Orange| 9| 40|
#| Lacoste| Red| 4| 70|
#| Gap| Green| null| 15|
#| Lacoste| Red| null| 50|
#| Gap|Orange| 5| 17|
#| Lacoste| Green| null| 40|
#|Banana Republic|Orange| null| null|
#+---------------+------+------+-----+
EDITED:
I would like to fill all NAs with the median based on brand and color, then based on brand alone- the result would be such that the only row with remaining null values would be the Banana Republic row (as there is no brand or brand/color combo for Banana Republic). The first answer nearly gets me there, but as you can see, I mistakenly hard-coded a column name- I would like this to iterate through the list of column names.
# Assign median based on the brand and color combination
median_columns = ['Rating','Price']
median_columns = ['Rating','Price']
for item in median_columns:
brand_window = Window.partitionBy('Brand')
brand_color_window = Window.partitionBy('Brand','Color')
brand_color_median = f.expr("percentile_approx('item', 0.5)")
ratings = ratings.withColumn(item,
f.coalesce(item,
brand_color_median.over(brand_color_window),
brand_color_median.over(brand_window)))
ratings.show()
#+---------------+------+------+-----+
#| Brand| Color|Rating|Price|
#+---------------+------+------+-----+
#| Gap| Green| null| 15.0|
#| Gap|Orange| 5.0| null|
#| Gap|Orange| 5.0| 17.0|
#| Gap|Orange| 8.0| null|
#| Gap| Red| 3.0| null|
#| Lacoste| Green| 5.0|200.0|
#| Lacoste| Green| null| 40.0|
#| Lacoste| Green| 3.0|150.0|
#| Lacoste|Orange| 9.0| 40.0|
#| Lacoste| Red| null| 50.0|
#| Lacoste| Red| 6.0| 30.0|
#| Lacoste| Red| 4.0| 70.0|
#|Banana Republic|Orange| null| null|
#+---------------+------+------+-----+
The nulls don't get overwritten- what am I missing?
question from:
https://stackoverflow.com/questions/65924104/pyspark-iteratively-and-conditionally-compute-median-fill-nas