Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
117 views
in Technique[技术] by (71.8m points)

Roll data forward in python keeping the days consistent

I have a set of data that will remain static. I haven't determined the best format to store it yet, probably a DataFrame, but I'm open to suggestions.

My aim is to roll the data forward, so my static dataset is up to date, but I need to keep the days consistent. For example, if my static data looks like this:


         Date  Day Temp
0  01/01/2021  Fri   20
1  02/01/2021  Sat   17
2  03/01/2021  Sun   17
3  04/01/2021  Mon   16
4  05/01/2021  Tue   20
5  06/01/2021  Wed   16
6  07/01/2021  Thu   21
7  08/01/2021  Fri   19

The last data I will have is for 08/01/2021. If I were to run this data on 09/01/2021, it would look like:


Date  Day Temp
0  02/01/2021  Sat   17
1  03/01/2021  Sun   17
2  04/01/2021  Mon   16
3  05/01/2021  Tue   20
4  06/01/2021  Wed   16
5  07/01/2021  Thu   21
6  08/01/2021  Fri   19
7  09/01/2021  Sat   17

The 09/01/2021 is a Saturday, so it rolled the data from the last Saturday in the dataset, 02/01/2021, forward.

I won't be able to determine ahead of time of many days the dataset will contain, and I may have to roll the data forward multiple times, for example, the last data in the static dataset might be 15/05/2018, and I need to roll it on to today.

I've looked at numpy roll and pandas shift, but I'm struggling to keep the days consistent.

Out of interest, it is possible to improve this even further. Rather than matching the oldest day and rolling it forward, it first tries to match on the oldest data with the same month and day - so that the monthly temperatures remain consistent?

Thanks

question from:https://stackoverflow.com/questions/66054997/roll-data-forward-in-python-keeping-the-days-consistent

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If the static data isn't guaranteed to have sequential dates, etc. you can use a groupby to the last record that matches the day of week and month. If there isn't any data matching that day of week, this will give a KeyError.

Here is your initial df:

df = pd.DataFrame({'Date': {0: pd.to_datetime('2021-01-01'),
1: pd.to_datetime('2021-01-02'),
2: pd.to_datetime('2021-01-03'), 
3: pd.to_datetime('2021-01-04'),
4: pd.to_datetime('2021-01-05'),
5: pd.to_datetime('2021-01-06'),
6: pd.to_datetime('2021-01-07'),
7: pd.to_datetime('2021-01-08')},
'Day': {0: 'Fri',  1: 'Sat',  2: 'Sun', 3: 'Mon', 
         4: 'Tue', 5: 'Wed', 6: 'Thu', 7: 'Fri'},
 'Temp': {0: 20, 1: 17, 2: 17, 3: 16, 4: 20, 
          5: 16, 6: 21, 7: 19}})

If you want to roll off the first record, you can do that multiple ways... this uses shift(), although that will make the df['Temp'] column a float because of the NaN, if that matters.

df = df.shift(-1).dropna()

The next part builds lookup tables from your static data.

  • The first is indexed by month and day_of_week, and retains the most recent value of each pair -- I think that is what you wanted. If you really did want the oldest date, just use first instead of last.
  • The second one, lookup2, drops the month index and gives you data to fall back to to get the most recent matching day_of_week, ignoring the month.

Code:

lookup = df.groupby([df['Date'].dt.month, df['Date'].dt.day_of_week]).last()
lookup2 = lookup.droplevel(0)
lookup2 = lookup2[~lookup2.index.duplicated(keep='last')]

The code to insert records is below. I enclosed it in a loop that adds the next day each time so you can test it adding multiple days in a row:

ndays = 5
for i in range(ndays):
    next_date = df.iloc[-1]['Date'] + pd.Timedelta("1D")
    try:
        temp = lookup.loc[(next_date.month, next_date.day_of_week), 'Temp']
    except KeyError:
        temp = lookup2.loc[next_date.day_of_week, 'Temp']
    df = df.append( {'Date' : next_date, 'Day' : next_date.strftime("%a"), 'Temp' : temp}, ignore_index=True)

print(df)
         Date  Day  Temp
0  2021-01-02  Sat  17.0
1  2021-01-03  Sun  17.0
2  2021-01-04  Mon  16.0
3  2021-01-05  Tue  20.0
4  2021-01-06  Wed  16.0
5  2021-01-07  Thu  21.0
6  2021-01-08  Fri  19.0
7  2021-01-09  Sat  17.0
8  2021-01-10  Sun  17.0
9  2021-01-11  Mon  16.0
10 2021-01-12  Tue  20.0
11 2021-01-13  Wed  16.0

In any case, hopefully this is at least on the right track for what you asked, and you can modify it to suit your needs.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...