Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
993 views
in Technique[技术] by (71.8m points)

list - How do you split reading a large csv file into evenly-sized chunks in Python?

In a basic I had the next process.

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
    process_line(line)

See this related question. I want to send the process line every 100 rows, to implement batch sharding.

The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

How can I solve this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Further reading: How do you split a list into evenly sized chunks in Python?


Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]  # or: chunk = []
    chunk.append(line)

# process the remainder
process_chunk(chunk)

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]  # or: chunk = []
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

There is a minor gotcha, as @totalhack points out:

Be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

56.8k users

...