I have some data that looks like:
df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'],
"val": [9, 2, 4, 7, 6, 3, 2],
"dates": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record.
For example the final output should be:
df_end_result = pd.DataFrame({"id": [102, 102, 103, 103, 104], "var1": ['a', 'b', 'b', 'a', 'c'],
"val": [13, 2, 13, 3, 2],
"start_date": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)],
"end_date": [pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I have tried this a few ways and keep failing, the length of time that something can exist for is unknown and the possible number of var1 can change with each id and with date window as well.
For example I have tried to identify consecutive days like this, but it always returns ['count_days'] == 0 (clearly something is wrong!). Then I thought I could take date(min) and date(min)+count_days to get 'start_date' and 'end_date'
s = df_raw_dates.groupby(['id','var1']).dates.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['count_days'] = np.where(s1, s1.groupby(df.id).cumsum(), 0)
I have also tried:
df = df_raw_dates.groupby(['id', 'var1']).agg({'val': 'sum', 'date': ['first', 'last']}).reset_index()
Which gets me closer, but I don't think this deals with the consecutive days problem but instead provides the earliest and latest day which unfortunately isn't something that I can take forward.
EDIT: adding more context
Another approach is:
df = df_raw_dates.groupby(['id', 'dates']).size().reset_index().rename(columns={0: 'del'}).drop('del', axis=1)
which provides a list of ids and dates, but I am getting stuck with finding min max consecutive dates within this new window
Extended example that has a break in the date range for group (102,'a')
.
df_raw_dates = pd.DataFrame(
{
"id": [102, 102, 102, 103, 103, 103, 104, 102, 102, 102, 102, 108, 108],
"var1": ["a", "b", "a", "b", "b", "a", "c", "a", "a", "a", "a", "a", "a"],
"val": [9, 2, 4, 7, 6, 3, 2, 1, 2, 3, 4, 99, 99],
"dates": [
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 7),
pd.Timestamp(2020, 1, 8),
pd.Timestamp(2020, 1, 9),
pd.Timestamp(2020, 1, 21),
pd.Timestamp(2020, 1, 25),
],
}
)
Further example
This is using the anwser below from wwii
import pandas as pd
import collections
df_raw_dates1 = pd.DataFrame(
{
"id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
"var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
"val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
"dates": [
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18)
],
}
)
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed, var1 = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(g.val.sum())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(), date_range['dates'].max()
val = date_range.val.sum()
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>> id var1 val start end
0 100 a 0.0 2021-01-22 2021-01-22
1 100 b 0.0 2021-01-22 2021-01-22
2 100 d 0.0 2021-01-22 2021-01-22
3 105 a 0.0 2021-01-22 2021-01-22
4 105 a 1.0 2021-01-21 2021-01-21
5 105 a 0.0 2021-01-20 2021-01-20
6 105 a 10.0 2021-01-19 2021-01-19
7 105 b 2.0 2021-01-22 2021-01-22
8 105 b 9.0 2021-01-21 2021-01-21
9 105 b 5.0 2021-01-20 2021-01-20
10 105 b 13.0 2021-01-19 2021-01-19
From the above I would have expected the rows 3,4,5,6 to be grouped together and 7,8,9,10 also. I am not sure why this example now breaks?
Not sure what the difference with this example and the extended example above is and why this seems to not work?
question from:
https://stackoverflow.com/questions/65849441/grouping-data-by-id-var1-into-consecutive-dates-in-python-using-pandas