Finding the last occurrence of a word

2019-07-29 13:42发布

问题:

I have the following string:

<SEM>electric</SEM> cu <SEM>hello</SEM> rent <SEM>is<I>love</I>, <PARTITION />mind

I want to find the last "SEM" start tag before the "PARTITION" tag. not the SEM end tag but the start tag. The result should be:

<SEM>is <Im>love</Im>, <PARTITION />

I have tried this regular expression:

<SEM>[^<]*<PARTITION[ ]/>

but it only works if the final "SEM" and "PARTITION" tags do not have any other tag between them. Any ideas?

回答1:

And here's your goofy Regex!!!

(?=[\s\S]*?\<PARTITION)(?![\s\S]+?\<SEM\>)\<SEM\>

What that says is "While ahead somewhere is a PARTITION tag... but while ahead is NOT another SEM tag... match a SEM tag."

Enjoy!

Here's that regex broken down:

(?=[\s\S]*?\<PARTITION) means "While ahead somewhere is a PARTITION tag"
(?![\s\S]+?\<SEM\>) means "While ahead somewhere is not a SEM tag"
\<SEM\> means "Match a SEM tag"


回答2:

Use String.IndexOf to find PARTITION and String.LastIndexOf to find SEM?

int partitionIndex = text.IndexOf("<PARTITION");
int emIndex = text.LastIndexOf("<SEM>", partitionIndex);


回答3:

If you are going to use a regex to find the last occurrence of something then you might also want to use the right-to-left parsing regex option:

new Regex("...", RegexOptions.RightToLeft);


回答4:

The solution is this, i have tested in http://regexlib.com/RETester.aspx

<\s*SEM\s*>(?!.*</SEM>.*).*<\s*PARTITION\s*/> 

As you want the last one, the only way to identify is to find only the characters that don't contain </SEM>.

I have included "\s*" in case there are some spaces in <SEM> or <PARTITION/>.

Basically, what we do is exclude the word </SEM> with:

(?!.*</SEM>.*)


回答5:

Bit quick-and-dirty, but try this:

(<SEM>.*?</SEM>.*?)*(<SEM>.*?<PARTITION)

and take a look at what's in the C#/.net equivalent of $2

The secret lies in the lazy-matching construct (.*?) --- I assume/hope C# supports this.

Clearly, Jon Skeet's solution will perform better, but you may want to use a regex (to simplify breaking up the bits that interest you, for example).

(Disclaimer: I'm a Perl/Python/Ruby person myself...)



回答6:

Have you tried this:

<EM>.*<PARTITION\s*/>

Your regular expression was matching anything but "<" after the "EM" tag. Therefore it would stop matching when it hit the closing "EM" tag.