Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
221 views
in Technique[技术] by (71.8m points)

python - Store very huge data in txt file format with Tab Separated values

I am loading the dataset from SQL DB by using pd.read_sql(). I tried to store 100 million rows and 300 columns in an excel/csv file. But it failed due to the limitation of 1,048,576 rows.

So I am trying to store the same file as .tsv file using

pd.to_csv("data.txt", header=True, index=False, sep='', mode='a')

I dont find the limitation of tab separated txt file.

is it good to go or is there any other good option?

question from:https://stackoverflow.com/questions/65880539/store-very-huge-data-in-txt-file-format-with-tab-separated-values

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The only thing here that I am not sure about is how pandas internally works. Besides that, your approach is totally fine. Hadoop widely uses .tsv format to store and process data. And there is no such thing like "the limitation of .tsv file". A file is just a sequence of bytes. and are just characters without any differences. The limitation you encountered is imposed by Microsoft Excel, not by OS. For example, it was lower a long time ago and other spread sheet applications could impose different limitations.

If you open('your_file.tsv', 'wt') and readline, bytes until are just taken. Nothing else happens. There is no such thing like how many s are allowed until , how many s are allowed in a file. They are all just bytes and a file can take as much as characters allowed by OS.

It varies across different OSs, however, according to NTFS vs FAT vs exFAT, the maximum file size of an NTFS file system is almost 16TB. But in real, splitting a big file into multiple files of a reasonable size is a good idea. For example, you can easily distribute them.

To process such big data, you should take iterative or distributed approach. For example, Hadoop.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...