I'm trying to split text into chunks to send to Google's text-to-speech engine (which accepts max. 5000 characters per query). I want to split longer files on a whitespace character with a maximum length of 5000 characters. My current code (using a chunk size of 15 instead of 5000):
def split_text(text) -> list: start = 0 chunk_size = 15 chunk = '' chunks = [] chunks_remaining = True while chunks_remaining: end = start + chunk_size if end >= len(text): chunks_remaining = False chunk = text[start:end] end = chunk.rfind(' ') + start chunks.append(text[start:end] + "...") start = end+1 return chunks def main(): text = "This is just a text string for demonstrative purposes." chunks = split_text(text) print(chunks)
Is there a way to replace chunk.rfind(' ') with something that accepts any whitespace character?
chunk.rfind(' ')
i = -1 while (true): if chunk[i] in [' ',' ','', ' ']: end = i else: i -= 1
Would something like that work for you? It would scan the chunk from the end for any whitespace character.
2.1m questions
2.1m answers
60 comments
57.0k users