This is likely a very simple fix but I can't figure it out!
I'm trying to match (up to) 3 capitalized words in a row given the following text.
Russell Lake West
. The match should include all 3 words.
This regex will match the first 2 words but not the third (demo here):
(([A-Z][a-z]+)\s{0,2}([A-Z][a-z]+)?\s{0,2}([A-Z][a-z]+)?)
This regex will match all 3 words, but I had to copy/paste the whitespace between Lake
and West
for it to work (demo here):
(([A-Z][a-z'-]+)\s{0,2}([A-Z][a-z'-]+)? \s{0,2}([A-Z][a-z'-]+)?)
^ pasted it here
So I assumed that maybe the whitespace isn't being treated as whitespace, but perhaps a newline character or similar, so I tried this (demo here):
[\r\n\t\f\s]West
But it doesn't recognize any of those characters before West
, thus returning no results.
Why can't regex101 or Java recognize this apparent whitespace between Lake
and West
? What's a reliable way to handle this?
There are many kinds of spaces. The one you are using in your demo is non-breaking one (indexed as 160 in Unicode table) which doesn't belong to
\s
(whitespaces character class) as it doesn't represent place on which we can expect text to be split into separate parts like lines.BTW
\s
already represents:\r
\n
\t
\f
.To match it you can use
\p{Zs}
class.You can also combine both
\s
and\p{Zs}
classes with[\\p{Zs}\\s]
.