Python regex convert youtube url to youtube video

2019-04-08 01:22发布

I'm making a regex so I can find youtube links (can be multiple) in a piece of HTML text posted by an user.

Currently I'm using the following regex to change 'http://www.youtube.com/watch?v=-JyZLS2IhkQ' into displaying the corresponding youtube video:

return re.compile('(http(s|):\/\/|)(www.|)youtube.(com|nl)\/watch\?v\=([a-zA-Z0-9-_=]+)').sub(tag, value)

(where the variable 'tag' is a bit of html so the video works and 'value' a user post)

Now this works.. until the url is like this:

'http://www.youtube.com/watch?v=-JyZLS2IhkQ&feature...'

Now I'm hoping you guys could help me figure how to also match the '&feature...' part so it disappears.

Example HTML:

No replies to this post..

Youtube vid:

http://www.youtube.com/watch?v=-JyZLS2IhkQ

More blabla

Thanks for your thoughts, much appreciated

Stefan

4条回答
▲ chillily
2楼-- · 2019-04-08 02:06

Here how I'm solving it:

def youtube_url_validation(url):
    youtube_regex = (
        r'(https?://)?(www\.)?'
        '(youtube|youtu|youtube-nocookie)\.(com|be)/'
        '(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')

    youtube_regex_match = re.match(youtube_regex, url)
    if youtube_regex_match:
        return youtube_regex_match.group(6)

    return youtube_regex_match

TESTS:

youtube_urls_test = [
    'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
    'http://youtu.be/5Y6HSHwhVlY', 
    'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
    'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
    'http://www.youtube.com/',
    'http://www.youtube.com/?feature=ytca']


for url in youtube_urls_test:
    m = youtube_url_validation(url)
    if m:
        print 'OK {}'.format(url)
        print m.groups()
        print m.group(6)
    else:
        print 'FAIL {}'.format(url)
查看更多
ゆ 、 Hurt°
3楼-- · 2019-04-08 02:06

What if you used the urlparse module to pick apart the youtube address you find and put it back into the format you want? You could then simplify your regex so that it only finds the entire url and then use urlparse to do the heavy lifting of picking it apart for you.

from urlparse import urlparse,parse_qs,urlunparse
from urllib import urlencode
youtube_url = urlparse('http://www.youtube.com/watch?v=aFNzk7TVUeY&feature=grec_index')
params = parse_qs(youtube_url.query)
new_params = {'v': params['v'][0]}

cleaned_youtube_url = urlunparse((youtube_url.scheme, \
                                  youtube_url.netloc, \
                                  youtube_url.path,
                                  None, \
                                  urlencode(new_params), \
                                  youtube_url.fragment))

It's a bit more code, but it allows you to avoid regex madness.

And as hop said, you should use raw strings for the regex.

查看更多
再贱就再见
4楼-- · 2019-04-08 02:08

Here's how I implemented it in my script:

string = "Hey, check out this video: https://www.youtube.com/watch?v=bS5P_LAqiVg"

youtube = re.findall(r'(https?://)?(www\.)?((youtube\.(com))/watch\?v=([-\w]+)|youtu\.be/([-\w]+))', string)

if youtube:
    print youtube

That outputs:

["", "youtube.com/watch?v=BS5P_LAqiVg", ".com", "watch", "com", "bS5P_LAqiVg", ""]

If you just wanted to grab the video id, for example, you would do:

video_id = [c for c in youtube[0] if c] # Get rid of empty list objects
video_id = video_id[len(video_id)-1] # Return the last item in the list
查看更多
劳资没心,怎么记你
5楼-- · 2019-04-08 02:12

You should specify your regular expressions as raw strings.

You don't have to escape every character that looks special, just the ones which are.

Instead of specifying an empty branch ((foo|)) to make something optional, you can use ?.

If you want to include - in a character set, you have to escape it or put it at right after the opening bracket.

You can use special character sets like \w (equals [a-zA-Z0-9_]) to shorten your regex.

r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([-\w]+)'

Now, in order to match the whole URL, you have to think about what can or cannot follow it in the input. Then you put that into a lookahead group (you don't want to consume it).

In this example I took everything except -, =, %, & and alphanumerical characters to end the URL (too lazy to think about it any harder).

Everything between the v-argument and the end of the URL is non-greedily consumed by .*?.

r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([\w-]+)(&.*?)?(?=[^-\w&=%])'

Still, I would not put too much faith into this general solution. User input is notoriously hard to parse robustly.

查看更多
登录 后发表回答