Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
116 views
in Technique[技术] by (71.8m points)

python - How to take a column of lists of dictionary values and create new columns using their values (not keys)

I'm analyzing Political Advertisements from Facebook, which is a dataset released here, by ProPublica.

There's an entire column of 'targets' that I want to analyze, but it's formatted such that every observation is a list of dicts in string form (e.g. "[{k1: v1}, {k2: v2}]").

import pandas as pd

data = {0: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}]', 1: '[{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]', 2: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}]', 3: '[]', 4: '[{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]'}

df = pd.DataFrame.from_dict(data, orient='index', columns=['targets'])

# display(df)
                                                                                                                                                                                                                                                                            targets
0                                                   [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}]
1                                                 [{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]
2                                                                                                                               [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}]
3                                                                                                                                                                                                                                                                                []
4  [{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]

I need to separate every "target" value to become the column header, with each corresponding "segment" value to be a value within that column.

Or, is the solution to create a function, to call each dictionary key within each row, to count frequency?

This is what it's supposed to look like as the output:

           NAge MinAge                                   Retargeting             Region  ...                          Interest Location Granularity            Country Gender           NAge MinAge                                   Retargeting             Region  ...                          Interest Location Granularity            Country Gender
0  21 and older     21  people who may be similar to their customers  the United States  ...                               NaN                  NaN                NaN    NaN
1  18 and older     18                                           NaN                NaN  ...  Republican Party (United States)              country  the United States    NaN
2  18 and older     18                                           NaN                NaN  ...                               NaN              country  the United States  women```

Someone on Reddit posted this solution:

import json

for id,row in enumerate(df.targets):
    for d in json.loads(row):
        df.loc[id,d['target']] = d['segment']

df = df.drop(columns=['targets'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-53-339ae1670258> in <module>
      2 for id,row in enumerate(df.targets):
      3     for d in json.loads(row):
----> 4         df.loc[id,d['target']] = d['segment']
      5 
      6 df = df.drop(columns=['targets'])

KeyError: 'segment'
question from:https://stackoverflow.com/questions/65623631/how-to-take-a-column-of-lists-of-dictionary-values-and-create-new-columns-using

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
  • def fix() is not vectorized, even so, it only takes 591 ms to apply to the 222186 rows in the file.
  • Replace NaN in the column, using .fillna(), otherwise literal_eval will result in ValueError: malformed node or string: nan
  • Replace 'null' with 'None', otherwise literal_eval will result in ValueError: malformed node or string: <_ast.Name object at 0x000002219927A0A0>
  • The values in the 'targets' rows are all str type, which can be converted to lists with ast.literal_eval.
  • def fix() iterates through the dicts in the list and then uses only the values to create a key-value pair in a dict, thereby converting each list of dicts to a single dict.
    • Empty lists are replaced with empty dicts, which is required for .json_normalize() to work on the column.
  • pandas.json_normalized() can then easily be used on the column.
  • Also see How to split a pandas column with a list of dicts into separate columns for each key for an alternate approach with the same data.
    • As shown in the accepted answer, it's actually easier to get all the unique counts when the 'targets' column is expanded long wise (tidy format), using .groupby and aggregating .count().
import pandas as pd
from ast import literal_eval

# load the file
df = pd.read_csv('en-US.csv')

# replace NaNs with '[]', otherwise literal_eval will error
df.targets = df.targets.fillna('[]')

# replace null with None, otherwise literal_eval will error
df.targets = df.targets.str.replace('null', 'None')

# convert the strings to lists of dicts
df.targets = df.targets.apply(literal_eval)

# function to transform the list of dicts in each row
def fix(col):
    dd = dict()
    for d in col:
        values = list(d.values())
        if len(values) == 2:
            dd[values[0]] = values[1]
    return dd

# apply the function to targets
df.targets = df.targets.apply(fix)

# display(df.targets.head())
                                                                                                                                  targets
0     {'Age': '18 and older', 'MinAge': '18', 'Segment': 'Multicultural affinity: African American (US).', 'Region': 'the United States'}
1   {'Age': '45 and older', 'MinAge': '45', 'Retargeting': 'people who may be similar to their customers', 'Region': 'the United States'}
2                                                                              {'Age': '18 and older', 'MinAge': '18', 'Region': 'Texas'}
3                                                                                                                                      {}
4  {'Interest': 'The Washington Post', 'Gender': 'men', 'Age': '34 to 49', 'MinAge': '34', 'MaxAge': '49', 'Region': 'the United States'}

# normalize the targets column
normalized = pd.json_normalize(df.targets)

# join normalized back to df if desired
df = df.join(normalized).drop(columns=['targets'])

normalized wide-format, for the sample data

# display(normalized.head())
            Age MinAge                                         Segment             Region                                   Retargeting             Interest Gender MaxAge
0  18 and older     18  Multicultural affinity: African American (US).  the United States                                           NaN                  NaN    NaN    NaN
1  45 and older     45                                             NaN  the United States  people who may be similar to their customers                  NaN    NaN    NaN
2  18 and older     18                                             NaN              Texas                                           NaN                  NaN    NaN    NaN
3           NaN    NaN                                             NaN                NaN                                           NaN                  NaN    NaN    NaN
4      34 to 49     34                                             NaN  the United States                                           NaN  The Washington Post    men     49

normalized wide-format, for the full dataset

  • As you can see from .info() the targets column contains a number of different keys, but not all rows contain all keys, so there are many NaNs
  • In order to get the unique value counts for this wide data format, use something like normalized.Age.value_counts().
print(normalized.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222186 entries, 0 to 222185
Data columns (total 26 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   Age                              157816 non-null  object
 1   MinAge                           156531 non-null  object
 2   Segment                          12288 non-null   object
 3   Region                           111638 non-null  object
 4   Retargeting                      39286 non-null   object
 5   Interest                         31514 non-null   object
 6   Gender                           7194 non-null    object
 7   MaxAge                           7767 non-null    object
 8   City                             23685 non-null   object
 9   State                            23685 non-null   object
 10  Website                          6235 non-null    object
 11  Language                         2584 non-null    object
 12  Audience Owner                   17859 non-null   object
 13  Location Granularity             29770 non-null   object
 14  Location Type                    29770 non-null   object
 15  Agency                           400 non-null     object
 16  List                             5034 non-null    object
 17  Custom Audience Match Key        1144 non-null    object
 18  Mobile App                       50 non-null      object
 19  Country                          22118 non-null   object
 20  Activity on the Facebook Family  3382 non-null    object
 21  Like                             855 non-null     object
 22  Education                        151 non-null     object
 23  Job Title                        15 non-null      object
 24  Relationship Status              22 non-null      object
 25  Employer                         4 non-null       object
dtypes: object(26)
memory usage: 44.1+ MB

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...