Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
231 views
in Technique[技术] by (71.8m points)

python - Regex for extracting names starting with Mr.|Mrs|The|DR after honorable

I was trying to write regex for identifying name starting with MR|MS|THE|DR after honorable

for example

      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 1    VIKRAM NATH,HONOURABLE MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 2    VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M.    0     1      0     0       1
      PANCHOLI
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 3    VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH   107    4     10     6      127
      J. SHASTRI

So, the output should be

[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE J.B.PARDIWALA]
[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE VIPUL M. PANCHOLI]
and so on

but I'm getting

THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH 
MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA

I have tried s*HONOURABLEs+(?=THE|MR|MS|DR)([^/[] ]*)

HONOURABLE can be repeated any no. of times.

Any help would be appreciated

Thanks in advance!

question from:https://stackoverflow.com/questions/66046399/regex-for-extracting-names-starting-with-mr-mrsthedr-after-honorable

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Bounty answer

You can use

import re
text = """     HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 1    VIKRAM NATH,HONOURABLE MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 2    VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M.    0     1      0     0       1
      PANCHOLI
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 3    VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH   107    4     10     6      127
      J. SHASTRI"""
text = re.sub(r'^[d ]+|[d ]+$', '', text, flags=re.M)
#print(text)
m = re.findall(r'^HONOURABLEs+(.*(?:
(?!HONOURABLE).*)*)', text, re.M)
for x in m:
    print(x.replace('
',' '))

Output:

[
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE J.B.PARDIWALA',
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. PANCHOLI',
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH J. SHASTRI'
]

See the Python demo.

Details:

  • re.sub(r'^[d ]+|[d ]+$', '', text, flags=re.M) removes all spaces, tabs and digits from the start and end of each line in your text.

  • r'^HONOURABLEs+(.*(?: (?!HONOURABLE).*)*)' is a regex that matches the following in the "trimmed" text:

  • ^ - start of a line

  • HONOURABLE - a word HONOURABLE

  • s+ - one or more whitespaces

  • (.*(?: (?!HONOURABLE).*)*) - Capturing group 1:

    • .* - the rest of the line
    • (?: (?!HONOURABLE).*)* - zero or more lines that do not start with HONOURABLE as a whole word.

Original answer You can use

HONOURABLEs+((?:THE|MR|MS|DR)[^,]*)

See the regex demo. If you do not want to have linebreaks in the resulting list items, you may later replace them with .replace(' ', ' '). If you want to curb the right hand boundary of your matches at [, and ], add them to the negated character class, change [^,] to [^][/,].

Details:

  • HONOURABLE - a whole word HONOURABLE
  • s+ - one or more whitespaces
  • ((?:THE|MR|MS|DR)[^,]*) - Capturing group 1: THE, MR, MS, DR followed with zero or more chars other than a comma.

See a Python demo:

import re
rx = r"HONOURABLEs+((?:THE|MR|MS|DR)[^,]*)"
text = "HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH
J. SHASTRI, HONOURABLE MS. ADITI GUPTA"
m = re.findall(rx, text)
print([x.replace('
','') for x in m])

Output:

['THE CHIEF JUSTICE MR. JUSTICEVIKRAM NATH', 'MR. JUSTICE ASHUTOSHJ. SHASTRI', 'MS. ADITI GUPTA']

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...