Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
744 views
in Technique[技术] by (71.8m points)

regex - Fuzzy Regular Expressions

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes.

Now I have a need to match strings against simple regular expressions such TV Schedule for dd (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2.

This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical.

Do you have any ideas how to do this effectively?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I found the TRE library, which seems to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/ It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.

I tried the accompanying agrep tool with the regexp on the following file:

TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March

and got

$ agrep -s -E 100 '^TV Schedule for dd (Jan|Feb|Mar)$' filename
1:TV Schedule for 10Jan
8:TVSchedule for Jan 10
7:T Schedule for 10 Jan 2010
3:TV Schedule for 10 March
15:Tv plan for March

Thanks a lot for all your suggestions.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...