- I can't tell you why, but I can tell you what.
The issue is caused by using date_format
and when the datetime
is the index.
- As noted by Stef: BUG (Performance): Performance of to_csv varies significantly depending on when/how index is set #37484
date_format='%Y-%m-%d %H:%M:%S'
is making the process slow, because it does not seem to be vectorized.
- This parameter seems to cause the process to write the row to a the file, and then change the format in the file, or it changes the date format, for the row, and then writes it.
- I can see that this is happening, because I stop the process, and the file is partially written.
- In this case, the time for all the data is
00:00:00
, which pandas doesn't display, however if there is a time other than 00:00:00
in the column, then all time components are displayed.
- If you write to a file without
date_format
, when all the time components are 00:00:00
, the format of the timestamp will be '%Y-%m-%d'
.
start = datetime(2020,1,1,0,0,0)
df.to_csv('df.csv', sep=',', quoting=csv.QUOTE_ALL)
# resulting csv
"timestamp","i"
"2020-01-01","0"
"2020-01-01","1"
"2020-01-01","2"
- If you write to a file without
date_format
, when all the time components are 00:00:01
, the format of the timestamp will be '%Y-%m-%d %H:%M:%S'
.
start = datetime(2020,1,1,0,0,1)
df.to_csv('df.csv', sep=',', quoting=csv.QUOTE_ALL)
# resulting csv
"timestamp","i"
"2020-01-01 00:00:01","0"
"2020-01-01 00:00:01","1"
"2020-01-01 00:00:01","2"
Resolve the issue by setting the format before .to_csv()
, or reset the datetime
index.
- If you need to set the format of a datetime column, do it before writing to the csv.
- Use
df.index.strftime('%Y-%m-%d %H:%M:%S')
or df[some column].dt.strftime('%Y-%m-%d %H:%M:%S')
.
'%Y-%m-%d %H:%M:%S'
is already the default format for a datetime
dtype
, so it's not necessary to reformat a datetime
formatted column to a string
.
df.index = df.index.strftime('%Y-%m-%d %H:%M:%S')
df.to_csv('df.csv', sep=',', quoting=csv.QUOTE_ALL)
df.reset_index(inplace=True)
df.to_csv('df.csv', sep=',', date_format='%Y-%m-%d %H:%M:%S', index=False, quoting=csv.QUOTE_ALL)
Test Data
- The code to write the test dataframe can be consolidated
datetime(2020,1,1,0,0,0)
is already a datetime
dtype
, so there is no reason to do df['timestamp'] = pd.to_datetime(df['timestamp'])
import pandas as pd
from datetime import datetime
cols = 2000000
df = pd.DataFrame({'i': range(cols)}, index=[datetime(2020,1,1,0,0,1)] * cols)
df.to_csv('df.csv', sep=',', quoting=csv.QUOTE_ALL)