i have a 900mb text file (pipe delim) that i need to convert to a pandas df and, ultimately, ingest to a postgres db.
i've tried looping to chunk, but it didn't work
df = pd.DataFrame()
for chunk in pd.read_csv(r"my_file.txt", sep='|', chunksize=1000):
df = pd.concat([df, chunk], ignore_index=True)
what else should i try? any help for a n00b is much appreciated. thank you!
EDIT (adding more detail): when trying to read the entire file and check nRows, using:
data = pd.read_csv(r"my_file.txt", sep='|')
print('Total rows: {0}'.format(len(data)))
print(list(data))
i'm thrown a DytpeWarning on ~50 columns (of ~300) asking to specify dtype option on import. i'm also thrown a MemoryError:
MemoryError: Unable to allocate 410. MiB for an array with shape (277, 388455) and data type object
out of curiosity, i tried reading different increments of nrows to see when the Dtype warning and file memory will be initially thrown - i'm able to read the first 2000 rows without either warning or error. i was able to read the first 240,000 rows without memory error, but with the Dtype warning on ~50 columns of the 300.
will i need to specify the Dtype in read_csv()
for each column to avoid the warning?
additionally, i'm unsure how to handle the memory error - as one commenter mentioned below, 900mb isn't exactly wildly massive.
question from:
https://stackoverflow.com/questions/65854245/best-way-to-convert-large-text-file-900-mb-300-columns-pipe-delim-to-pandas 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…