regex - Python Regular Expression matching multiple lines (re.DOTALL)

Question

Welcome To Ask or Share your Answers For Others

regex - Python Regular Expression matching multiple lines (re.DOTALL)

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Python Regular Expression matching multiple lines (re.DOTALL)

I'm trying to parse a string with multiple lines.

Suppose it is:

text = '''
Section1
stuff belonging to section1
stuff belonging to section1
stuff belonging to section1
Section2
stuff belonging to section2
stuff belonging to section2
stuff belonging to section2
'''

I want to use the finditer method of the re module to get a dictionary like:

{'section': 'Section1', 'section_data': 'stuff belonging to section1
stuff belonging to section1
stuff belonging to section1
'}
{'section': 'Section2', 'section_data': 'stuff belonging to section2
stuff belonging to section2
stuff belonging to section2
'}

I tried the following:

import re
re_sections=re.compile(r"(?P<section>Sectiond)s*(?P<section_data>.+)", re.DOTALL)
sections_it = re_sections.finditer(text)

for m in sections_it:
    print m.groupdict()

But this results in:

{'section': 'Section1', 'section_data': 'stuff belonging to section1
stuff belonging to    section1
stuff belonging to section1
Section2
stuff belonging to section2
stuff belonging to section2
stuff belonging to section2
'}

So the section_data also matches Section2.

I also tried to tell the second group to match all but the first one. But this leads to no output at all.

re_sections=re.compile(r"(?P<section>Sectiond)s+(?P<section_data>^(?P=section))", re.DOTALL)

I know I could use the following re, but I'm looking for a version, where I do not have to tell what the second group looks like.

re_sections=re.compile(r"(?P<section>Sectiond)s+(?P<section_data>[a-z12s]+)", re.DOTALL)

Thank you very much!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:56:16+0000

Use a look-ahead to match everything up to the next section header, or the end of the string:

re_sections=re.compile(r"(?P<section>Sectiond)s*(?P<section_data>.+?)(?=(?:Sectiond|$))", re.DOTALL)

Note that this needs a non-greedy .+? as well, otherwise it'll still match all the way to the end first.

Demo:

>>> re_sections=re.compile(r"(?P<section>Sectiond)s*(?P<section_data>.+?)(?=(?:Sectiond|$))", re.DOTALL)
>>> for m in re_sections.finditer(text): print m.groupdict()
... 
{'section': 'Section1', 'section_data': 'stuff belonging to section1
stuff belonging to section1
stuff belonging to section1
'}
{'section': 'Section2', 'section_data': 'stuff belonging to section2
stuff belonging to section2
stuff belonging to section2'}

Categories

regex - Python Regular Expression matching multiple lines (re.DOTALL)

regex - Python Regular Expression matching multiple lines (re.DOTALL)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags