Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
441 views
in Technique[技术] by (71.8m points)

ruby - Why won't a longer token in an alternation be matched?

I am using ruby 2.1, but the same thing can be replicated on rubular site.

If this is my string:

儘管中國婦幼衛生監測辦公室制定的

And I do a regex match with this expression:

(中國婦幼衛生監測辦公室制定|管中)

I am expecting to get the longer token as a match.

中國婦幼衛生監測辦公室制定

Instead I get the second alternation as a match.

As far as I know it does work like that when not in chinese characters.

If this is my string:

foobar

And I use this regex:

(foobar|foo)

Returned matching result is foobar. If the order is in the other way, than the matching string is foo. That makes sense to me.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Your assumption that regex matches a longer alternation is incorrect.

If you have a bit of time, let's look at how your regex works...

Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.

There are two pointers, one on the Pattern:

(cdefghijkl|bcd)

The other on your String:

abcdefghijklmnopqrstuvw

The pointer on the String moves from the left. As soon as it can return, it will:

x
(source: gyazo.com)

Let's turn that into a more "sequential" sequence for understanding:

y
(source: gyazo.com)

Your foobar example is a different topic. As I mentioned in this post:

How regex works: The state machine always reads from left to right. ,|,, == ,, as it always will only be matched to the first alternation.

? ? That's good, Unihedron, but how do I force it to the first alternation?

Look!*

^(?:.*?Kcdefghijkl|.*?Kbcd)

Here have a regex demo.

This regex first attempts to match the entire string with the first alternation. Only if it fails completely will it then attempt to match the second alternation. K is used here to keep the match with the contents behind the construct K.


*: K was supported in Ruby since 2.0.0.

Read more:





Ah, I was bored, so I optimized the regex:

^(?:(?:(?!cdefghijkl)c?[^c]*)++Kcdefghijkl|(?:(?!bcd)b?[^b]*)++Kbcd)

You can see a demo here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...