Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
258 views
in Technique[技术] by (71.8m points)

python - parsing a sentence - match inflections and skip punctuation

I'm trying to parse sentences in python- for any sentence I get I should take only the words that appear after the words 'say' or 'ask' (if the words doesn't appear, I should take to whole sentence) I simply did it with regular expressions:

sen = re.search('(?s)(?<=say|Say).*$', current_game_row["sentence"], re.M | re.I)

(this is only for 'say', but adding 'ask' is not a problem...)

The problem is that if I get a sentence with punctuations like comma, colon (,:) after the word 'say' it takes it too. Someone suggested me to use nltk tokenization in order to define it, but I'm new in python and don't understand how to use it. I see that nltk has the function RegexpParser but I'm not sure how to use it. Please help me :-)

** I forgot to mention that- I want to recognize 'said'/ asked etc. too and don't want to catch word that include the word 'say' or 'ask' (I'm not sure there are such words...). In addition, if where are multiply 'say' or 'ask' , I only want to catch the first token in in the sentence. **

question from:https://stackoverflow.com/questions/66060945/parsing-a-sentence-match-inflections-and-skip-punctuation

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Everything after a Keyword

We can deal with the unwanted punctuation by using w to eat up all non-unicode.

sentence = "Hearsay? With masked flasks I said: abracadabra"

keys = '|'.join(['ask', 'asks', 'asked', 'say', 'says', 'said'])
result = re.search(rf'({keys})W+(.*)', sentence, re.S | re.I)

if result == None:
    print(sentence)
else:    
    print(result.group(2))

Output:

abracadabra 

case-sensitive: You have case-insensitive flag re.I, so we can remove Say permutation.

multi-line: You have re.M option which directs ^ to not only match at the start of your string, but also right after every within that string. We can drop this since we do not need to use ^.

dot-matches-all: You have (?s) which directs . to match everything including . This is the same as applying re.S flag.

I'm not sure what the net effect of having both re.M and re.S is. I think your sentence might be a text blob with newlines inside, so I removed re.M and kept (?s) as re.S


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...