Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
229 views
in Technique[技术] by (71.8m points)

python - Working with pandas suffixes after merging dfs

General problem: I am performing a merge on two dataframes that sometimes have the same name for their date columns. Later, I have to perform operations using the two date columns. To access the date columns in the merged_df, I am using a lot of extra code. I am looking for help to find a more elegant solution.

Now, in detail: I have two dataframes. They each have a date column, whose name can be modified. Df1:

date1_column_name = 'date'
d1 = {'id': [1, 1, 2], date1_column_name: [pd.Timestamp('2020-10-16'), pd.Timestamp('2020-10-18'), pd.Timestamp('2020-12-4')], 'var1': [1, 2, 2]}
df1 = pd.DataFrame(data=d1)
...
>> df1 
   id        date  var1
0   1  2020-10-16     1
1   1  2020-10-18     2
2   2  2020-12-04     2

Df2:

date2_column_name = 'date'
d2 = {'id': [1, 1, 2, 2], date2_column_name: [pd.Timestamp('2020-10-11'), pd.Timestamp('2020-10-18'), pd.Timestamp('2020-12-3'), pd.Timestamp('2020-12-13')], 'var2': [1, 2, 2, 5]}
df2 = pd.DataFrame(data=d2)
...
>> df2 
   id        date  var2
0   1  2020-10-11     1
1   1  2020-10-18     2
2   2  2020-12-03     2
3   2  2020-12-13     5              

I am performing an outer merge on df1 and df on the column id:

df_merged = pd.merge(df1, df2,on='id', how='outer', suffixes=['_x','_y']).fillna(pd.NA)
...
>> df_merged
   id      date_x  var1      date_y  var2
0   1  2020-10-16     1  2020-10-11     1
1   1  2020-10-16     1  2020-10-18     2
2   1  2020-10-18     2  2020-10-11     1
3   1  2020-10-18     2  2020-10-18     2
4   2  2020-12-04     2   2020-12-3     2
5   2  2020-12-04     2  2020-12-13     5

If the date columns have the same names, pandas adds the _x and _y suffixes. I now want to work with those two date columns. My problem is that the suffixes make my further work process really messy since it makes me take several extra steps:

  • First, I add the suffixex manually to the variables column_names and rename the column in df1:
if date1_column_name == date2_column_name:
    df1 = df1.rename(columns={date1_column_name: date1_column_name + '_x'})
    date1_column_name = date1_column_name + '_x'
    date2_column_name = date2_column_name + '_y'   
  • After changing the variables date1_column_name and date2_column_name I can now calculate the time difference between the two date columns:
df_merged['time_difference'] = abs((df_merged[date1_column_name] - df_merged[date2_column_name]).dt.days)
  • Eventually, I need to drop the date_y column anyways, so I end up striping the _x suffix too (for aesthetics, mainly). Note that this is the part why I renamed the column from df1, because otherwise I couldn't use df1.columns (since it only has "date", not "date_x" inside of it):
variables_of_interest = ['var2']
df_merged = df_merged[df1.columns.to_list() + variables_of_interest]
df_merged.columns = df_merged.columns.str.rstrip('_x')  # strip _x suffix

I feel like I am overlooking a way more elegant solution to this. I would greatly appreciate any help to make this code cleaner and most of all, shorter.

question from:https://stackoverflow.com/questions/65918893/working-with-pandas-suffixes-after-merging-dfs

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...