Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
230 views
in Technique[技术] by (71.8m points)

How do I extract bytes with offsets from a huge block efficiently in Python?

Let's suppose I have a block of bytes like this:

block = b'0123456789AB'

I want to extract each sequence of 3 bytes from each chunk of 4 bytes and join them together. The result for the block above should be:

b'01245689A'  # 3, 7 and B are missed

I could solve this issue with such script:

block = b'0123456789AB'
result = b''
for i in range(0, len(block), 4):
    result += block[i:i + 3]
print(result)

But as it's known, Python is quite inefficient with such for-loops and bytes concatenations, thus my approach will never end if I apply it for a really huge block of bytes. So is there a faster way to perform?

question from:https://stackoverflow.com/questions/66047126/how-do-i-extract-bytes-with-offsets-from-a-huge-block-efficiently-in-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Make it mutable and delete the the unwanted slice?

>>> tmp = bytearray(block)
>>> del tmp[3::4]
>>> bytes(tmp)
b'01245689A'

If your chunks are large and you want to remove almost all bytes, it might become faster to instead collect what you do want, similar to yours. Although yours potentially takes quadratic time, better use join:

>>> b''.join([block[i : i+3] for i in range(0, len(block), 4)])
b'01245689A'

(Btw according to PEP 8 it should be block[i : i+3], not block[i:i + 3], and for good reason.)

Although that builds a lot of objects, which could be a memory problem. And for your stated case, it's much faster than yours but much slower than my bytearray one.

Benchmark with block = b'0123456789AB' * 100_000 (much smaller than the 1GB you mentioned in the comments below):

    0.00 ms      0.00 ms      0.00 ms  baseline
15267.60 ms  14724.33 ms  14712.70 ms  original
    2.46 ms      2.46 ms      3.45 ms  Kelly_Bundy_bytearray
   83.66 ms     85.27 ms    122.88 ms  Kelly_Bundy_join

Benchmark code:

import timeit

def baseline(block):
    pass

def original(block):
    result = b''
    for i in range(0, len(block), 4):
        result += block[i:i + 3]
    return result

def Kelly_Bundy_bytearray(block):
    tmp = bytearray(block)
    del tmp[3::4]
    return bytes(tmp)

def Kelly_Bundy_join(block):
    return b''.join([block[i : i+3] for i in range(0, len(block), 4)])

funcs = [
    baseline,
    original,
    Kelly_Bundy_bytearray,
    Kelly_Bundy_join,
    ]

block = b'0123456789AB' * 100_000
args = block,
number = 10**0

expect = original(*args)
for func in funcs:
    print(func(*args) == expect, func.__name__)
print()

tss = [[] for _ in funcs]
for _ in range(3):
    for func, ts in zip(funcs, tss):
        t = min(timeit.repeat(lambda: func(*args), number=number)) / number
        ts.append(t)
        print(*('%8.2f ms ' % (1e3 * t) for t in ts), func.__name__)
    print()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...