I'm making a regex so I can find youtube links (can be multiple) in a piece of HTML text posted by an user.
Currently I'm using the following regex to change 'http://www.youtube.com/watch?v=-JyZLS2IhkQ' into displaying the corresponding youtube video:
return re.compile('(http(s|):\/\/|)(www.|)youtube.(com|nl)\/watch\?v\=([a-zA-Z0-9-_=]+)').sub(tag, value)
(where the variable 'tag' is a bit of html so the video works and 'value' a user post)
Now this works.. until the url is like this:
'http://www.youtube.com/watch?v=-JyZLS2IhkQ&feature...'
Now I'm hoping you guys could help me figure how to also match the '&feature...' part so it disappears.
Example HTML:
No replies to this post..
Youtube vid:
http://www.youtube.com/watch?v=-JyZLS2IhkQ
More blabla
Thanks for your thoughts, much appreciated
Stefan
Here how I'm solving it:
TESTS:
What if you used the urlparse module to pick apart the youtube address you find and put it back into the format you want? You could then simplify your regex so that it only finds the entire url and then use urlparse to do the heavy lifting of picking it apart for you.
It's a bit more code, but it allows you to avoid regex madness.
And as hop said, you should use raw strings for the regex.
Here's how I implemented it in my script:
That outputs:
If you just wanted to grab the video id, for example, you would do:
You should specify your regular expressions as raw strings.
You don't have to escape every character that looks special, just the ones which are.
Instead of specifying an empty branch (
(foo|)
) to make something optional, you can use?
.If you want to include
-
in a character set, you have to escape it or put it at right after the opening bracket.You can use special character sets like
\w
(equals[a-zA-Z0-9_]
) to shorten your regex.Now, in order to match the whole URL, you have to think about what can or cannot follow it in the input. Then you put that into a lookahead group (you don't want to consume it).
In this example I took everything except
-
,=
,%
,&
and alphanumerical characters to end the URL (too lazy to think about it any harder).Everything between the v-argument and the end of the URL is non-greedily consumed by
.*?
.Still, I would not put too much faith into this general solution. User input is notoriously hard to parse robustly.