I have the following definition for an Identifier:
Identifier --> letter{ letter| digit}
Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.
I've tried this:
if re.match('\w+(\w\d)?', i):
return True
else:
return False
but when I run my program every time it meets an integer it thinks that it's a valid identifier.
For example
c = 0 ;
it prints c
as a valid identifier which is fine, but it also prints 0
as a valid identifer.
What am I doing wrong here?
From official reference:
identifier ::= (letter|"_") (letter | digit | "_")*
So the regular expression is:
Example (for Python 2 just omit
re.UNICODE
):Result:
Works like a charm:
r'[^\d\W][\w\d]+'
For Python 3, you need to handle Unicode letters and digits. So if that's a concern, you should get along with this:
[^\d\W]
matches a character that is not a digit and not "not alphanumeric" which translates to "a character that is a letter or underscore".str.isidentifier()
works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.@martineau's comment gives the example of
'℘᧚'
where the regex solutions fail.Why does this happen?
Lets define the sets of code points that match the given regular expression, and the set that match
str.isidentifier
.How many regex matches are not identifiers?
How many identifiers are not regex matches?
Interesting -- which ones?
What's different about these two sets?
They have different Unicode "General Category" values.
From wikipedia, that's
Letter, modifier
;Letter, other
;Number, other
. This is consistent with the re docs, since\d
is only decimal digits:What about the other way?
That's
Mark, nonspacing
;Symbol, math
;Symbol, other
.Where is this all documented?
Where is it implemented?
https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255
I still want a regular expression
Look at the regex module on PyPI.
It includes filters for "General Category".
\w matches digits and characters. Try
^[_a-zA-Z]\w*$