Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
242 views
in Technique[技术] by (71.8m points)

Python - Split a list with a long string based on 2 keywords

I have a list with a long string in it. How can I split the string to extract the sections from 'MyKeyword' to 'My Data'. These words appear multiple times in my list so I'd like to split it based on this and include the MyKeyword and MyData if possible

Current data example:

['MyKeyword This is my data MyData. MyKeyword and chunk of text here. Random text. MyData is this etc etc ']

Desired output:

['MyKeyword This is my data', 'MyData.', 'MyKeyword and chunk of text here. Random text.','MyData is this etc etc ']

Current code:


from itertools import groupby
#linelist = ["a", "b", "", "c", "d", "e", "", "a"]
split_at = "MyKeyword"
[list(g) for k, g in groupby(output2, lambda x: x != split_at) if k]
question from:https://stackoverflow.com/questions/65918346/python-split-a-list-with-a-long-string-based-on-2-keywords

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can use a regular expression, matching all the text from MyKeyword to MyData in lazy mode:

>>> import re
>>> re.findall("MyKeyword.*?MyData.?","MyKeyword This is my data, MyData. MyKeyword and chunk of text here. Random text. MyData is this etc etc ")
['MyKeyword This is my data, MyData.', 'MyKeyword and chunk of text here. Random text. MyData']
  • .*? means 0 to infinite characters, but in lazy mode (*?), i.e. as less as possible;
  • .? means an optional period.

EDIT (according to the new requirement):

The regex you need is something like

MyKeyword.*?(?= ?MyData|$)|MyData.*?(?= ?MyKeyword|$)

It starts from the point where it matches MyKeyword (resp. MyData), and then it catches as less characters as possible, as above, until it reaches MyData (resp. MyKeyword) or the end of the string.

Indeed:

  • | is a special character which means "or"
  • $ matches the end of the string
  • ? is an optional space
  • (?=<expr>) is called positive lookahead and it means "followed by <expr>"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...