Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
159 views
in Technique[技术] by (71.8m points)

python - pandas 1.2.1 to_csv performance with datetime as the index and setting date_format

I'm observing a massive performance regression after upgrade to Python 3.8 and pandas 1.2.1.

The following simple code takes nearly 8 minutes to complete:

import sys
import pandas as pd
import csv
import datetime

pd.show_versions()

start = datetime.datetime(2020,1,1,0,0,0)
print(str(start))
data = { 'timestamp': [], 'i': []}
for i in range(0, 2000000):
        data['timestamp'].append(str(start))
        data['i'].append(i)

df = pd.DataFrame(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
print(df.tail(10), flush=True)

df.to_csv('/DATA1/TEMP/df.csv', sep=',', date_format='%Y-%m-%d %H:%M:%S', quoting=csv.QUOTE_ALL)
print("DONE", flush=True)

The output of this script in my virtual environment is:

/home/user/venv-test/lib/python3.8/site-packages/setuptools/distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
  warnings.warn(

INSTALLED VERSIONS
------------------
commit           : 9d598a5e1eee26df95b3910e3f2934890d062caa
python           : 3.8.7.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1062.el7.x86_64
Version          : #1 SMP Wed Aug 7 18:08:02 UTC 2019
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.2.1
numpy            : 1.19.5
pytz             : 2020.5
dateutil         : 2.8.1
pip              : 21.0
setuptools       : 49.2.1
Cython           : 0.29.21
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.0
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
2020-01-01 00:00:00
                  
timestamp          
2020-01-01  1999990
2020-01-01  1999991
2020-01-01  1999992
2020-01-01  1999993
2020-01-01  1999994
2020-01-01  1999995
2020-01-01  1999996
2020-01-01  1999997
2020-01-01  1999998
2020-01-01  1999999
DONE

real    7m49.337s
user    7m39.805s
sys     0m14.665s

On my production environment this is run on a dataframe with about 20 more columns and the time of the to_csv() call went from a few minutes to more than 40 hours.

Am I missing something obvious here? Or is there maybe some known bug in this specific version of pandas?

I checked my system is not IO bound by dumping the file to several different partitions on different hard drives. Additionally the process stays at 100% CPU usage while dumping the data frame to CSV what would be inconsistent with an IO bound system.

question from:https://stackoverflow.com/questions/65903287/pandas-1-2-1-to-csv-performance-with-datetime-as-the-index-and-setting-date-form

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
  • I can't tell you why, but I can tell you what.

The issue is caused by using date_format and when the datetime is the index.

  • As noted by Stef: BUG (Performance): Performance of to_csv varies significantly depending on when/how index is set #37484
  • date_format='%Y-%m-%d %H:%M:%S' is making the process slow, because it does not seem to be vectorized.
    • This parameter seems to cause the process to write the row to a the file, and then change the format in the file, or it changes the date format, for the row, and then writes it.
    • I can see that this is happening, because I stop the process, and the file is partially written.
  • In this case, the time for all the data is 00:00:00, which pandas doesn't display, however if there is a time other than 00:00:00 in the column, then all time components are displayed.
    • If you write to a file without date_format, when all the time components are 00:00:00, the format of the timestamp will be '%Y-%m-%d'.
start = datetime(2020,1,1,0,0,0)
df.to_csv('df.csv', sep=',', quoting=csv.QUOTE_ALL)

# resulting csv
"timestamp","i"
"2020-01-01","0"
"2020-01-01","1"
"2020-01-01","2"
  • If you write to a file without date_format, when all the time components are 00:00:01, the format of the timestamp will be '%Y-%m-%d %H:%M:%S'.
start = datetime(2020,1,1,0,0,1)
df.to_csv('df.csv', sep=',', quoting=csv.QUOTE_ALL)

# resulting csv
"timestamp","i"
"2020-01-01 00:00:01","0"
"2020-01-01 00:00:01","1"
"2020-01-01 00:00:01","2"

Resolve the issue by setting the format before .to_csv(), or reset the datetime index.

  • If you need to set the format of a datetime column, do it before writing to the csv.
  • Use df.index.strftime('%Y-%m-%d %H:%M:%S') or df[some column].dt.strftime('%Y-%m-%d %H:%M:%S').
  • '%Y-%m-%d %H:%M:%S' is already the default format for a datetime dtype, so it's not necessary to reformat a datetime formatted column to a string.
df.index = df.index.strftime('%Y-%m-%d %H:%M:%S')
df.to_csv('df.csv', sep=',', quoting=csv.QUOTE_ALL)
df.reset_index(inplace=True)
df.to_csv('df.csv', sep=',', date_format='%Y-%m-%d %H:%M:%S', index=False, quoting=csv.QUOTE_ALL)

Test Data

  • The code to write the test dataframe can be consolidated
  • datetime(2020,1,1,0,0,0) is already a datetime dtype, so there is no reason to do df['timestamp'] = pd.to_datetime(df['timestamp'])
import pandas as pd
from datetime import datetime

cols = 2000000
df = pd.DataFrame({'i': range(cols)}, index=[datetime(2020,1,1,0,0,1)] * cols)

df.to_csv('df.csv', sep=',', quoting=csv.QUOTE_ALL)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...