Bounty answer
You can use
import re
text = """ HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
1 VIKRAM NATH,HONOURABLE MR. JUSTICE 1 1 0 3 5
J.B.PARDIWALA
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
2 VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. 0 1 0 0 1
PANCHOLI
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
3 VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH 107 4 10 6 127
J. SHASTRI"""
text = re.sub(r'^[d ]+|[d ]+$', '', text, flags=re.M)
#print(text)
m = re.findall(r'^HONOURABLEs+(.*(?:
(?!HONOURABLE).*)*)', text, re.M)
for x in m:
print(x.replace('
',' '))
Output:
[
'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE J.B.PARDIWALA',
'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. PANCHOLI',
'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH J. SHASTRI'
]
See the Python demo.
Details:
re.sub(r'^[d ]+|[d ]+$', '', text, flags=re.M)
removes all spaces, tabs and digits from the start and end of each line in your text.
r'^HONOURABLEs+(.*(?:
(?!HONOURABLE).*)*)'
is a regex that matches the following in the "trimmed" text:
^
- start of a line
HONOURABLE
- a word HONOURABLE
s+
- one or more whitespaces
(.*(?:
(?!HONOURABLE).*)*)
- Capturing group 1:
.*
- the rest of the line
(?:
(?!HONOURABLE).*)*
- zero or more lines that do not start with HONOURABLE
as a whole word.
Original answer
You can use
HONOURABLEs+((?:THE|MR|MS|DR)[^,]*)
See the regex demo. If you do not want to have linebreaks in the resulting list items, you may later replace them with .replace('
', ' ')
. If you want to curb the right hand boundary of your matches at [
,
and ]
, add them to the negated character class, change [^,]
to [^][/,]
.
Details:
HONOURABLE
- a whole word HONOURABLE
s+
- one or more whitespaces
((?:THE|MR|MS|DR)[^,]*)
- Capturing group 1: THE
, MR
, MS
, DR
followed with zero or more chars other than a comma.
See a Python demo:
import re
rx = r"HONOURABLEs+((?:THE|MR|MS|DR)[^,]*)"
text = "HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH
J. SHASTRI, HONOURABLE MS. ADITI GUPTA"
m = re.findall(rx, text)
print([x.replace('
','') for x in m])
Output:
['THE CHIEF JUSTICE MR. JUSTICEVIKRAM NATH', 'MR. JUSTICE ASHUTOSHJ. SHASTRI', 'MS. ADITI GUPTA']
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…