General problem:
I am performing a merge on two dataframes that sometimes have the same name for their date columns. Later, I have to perform operations using the two date columns. To access the date columns in the merged_df, I am using a lot of extra code. I am looking for help to find a more elegant solution.
Now, in detail:
I have two dataframes. They each have a date column, whose name can be modified. Df1:
date1_column_name = 'date'
d1 = {'id': [1, 1, 2], date1_column_name: [pd.Timestamp('2020-10-16'), pd.Timestamp('2020-10-18'), pd.Timestamp('2020-12-4')], 'var1': [1, 2, 2]}
df1 = pd.DataFrame(data=d1)
...
>> df1
id date var1
0 1 2020-10-16 1
1 1 2020-10-18 2
2 2 2020-12-04 2
Df2:
date2_column_name = 'date'
d2 = {'id': [1, 1, 2, 2], date2_column_name: [pd.Timestamp('2020-10-11'), pd.Timestamp('2020-10-18'), pd.Timestamp('2020-12-3'), pd.Timestamp('2020-12-13')], 'var2': [1, 2, 2, 5]}
df2 = pd.DataFrame(data=d2)
...
>> df2
id date var2
0 1 2020-10-11 1
1 1 2020-10-18 2
2 2 2020-12-03 2
3 2 2020-12-13 5
I am performing an outer merge on df1 and df on the column id:
df_merged = pd.merge(df1, df2,on='id', how='outer', suffixes=['_x','_y']).fillna(pd.NA)
...
>> df_merged
id date_x var1 date_y var2
0 1 2020-10-16 1 2020-10-11 1
1 1 2020-10-16 1 2020-10-18 2
2 1 2020-10-18 2 2020-10-11 1
3 1 2020-10-18 2 2020-10-18 2
4 2 2020-12-04 2 2020-12-3 2
5 2 2020-12-04 2 2020-12-13 5
If the date columns have the same names, pandas adds the _x and _y suffixes. I now want to work with those two date columns. My problem is that the suffixes make my further work process really messy since it makes me take several extra steps:
- First, I add the suffixex manually to the variables column_names and rename the column in df1:
if date1_column_name == date2_column_name:
df1 = df1.rename(columns={date1_column_name: date1_column_name + '_x'})
date1_column_name = date1_column_name + '_x'
date2_column_name = date2_column_name + '_y'
- After changing the variables date1_column_name and date2_column_name I can now calculate the time difference between the two date columns:
df_merged['time_difference'] = abs((df_merged[date1_column_name] - df_merged[date2_column_name]).dt.days)
- Eventually, I need to drop the date_y column anyways, so I end up striping the _x suffix too (for aesthetics, mainly). Note that this is the part why I renamed the column from df1, because otherwise I couldn't use df1.columns (since it only has "date", not "date_x" inside of it):
variables_of_interest = ['var2']
df_merged = df_merged[df1.columns.to_list() + variables_of_interest]
df_merged.columns = df_merged.columns.str.rstrip('_x') # strip _x suffix
I feel like I am overlooking a way more elegant solution to this. I would greatly appreciate any help to make this code cleaner and most of all, shorter.
question from:
https://stackoverflow.com/questions/65918893/working-with-pandas-suffixes-after-merging-dfs