The only thing here that I am not sure about is how pandas
internally works. Besides that, your approach is totally fine. Hadoop widely uses .tsv
format to store and process data. And there is no such thing like "the limitation of .tsv
file". A file is just a sequence of bytes.
and
are just characters without any differences. The limitation you encountered is imposed by Microsoft Excel, not by OS. For example, it was lower a long time ago and other spread sheet applications could impose different limitations.
If you open('your_file.tsv', 'wt')
and readline
, bytes until
are just taken. Nothing else happens. There is no such thing like how many
s are allowed until
, how many
s are allowed in a file. They are all just bytes and a file can take as much as characters allowed by OS.
It varies across different OSs, however, according to NTFS vs FAT vs exFAT, the maximum file size of an NTFS file system is almost 16TB. But in real, splitting a big file into multiple files of a reasonable size is a good idea. For example, you can easily distribute them.
To process such big data, you should take iterative or distributed approach. For example, Hadoop.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…