Python regex convert youtube url to youtube video

I'm making a regex so I can find youtube links (can be multiple) in a piece of HTML text posted by an user.

Currently I'm using the following regex to change 'http://www.youtube.com/watch?v=-JyZLS2IhkQ' into displaying the corresponding youtube video:

return re.compile('(http(s|):\/\/|)(www.|)youtube.(com|nl)\/watch\?v\=([a-zA-Z0-9-_=]+)').sub(tag, value)

(where the variable 'tag' is a bit of html so the video works and 'value' a user post)

Now this works.. until the url is like this:

'http://www.youtube.com/watch?v=-JyZLS2IhkQ&feature...'

Now I'm hoping you guys could help me figure how to also match the '&feature...' part so it disappears.

Example HTML:

No replies to this post..

Youtube vid:

http://www.youtube.com/watch?v=-JyZLS2IhkQ

More blabla

Thanks for your thoughts, much appreciated

Stefan

标签： python regex url youtube

4条回答

▲ chillily

2楼-- · 2019-04-08 02:06

Here how I'm solving it:

def youtube_url_validation(url):
    youtube_regex = (
        r'(https?://)?(www\.)?'
        '(youtube|youtu|youtube-nocookie)\.(com|be)/'
        '(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')

    youtube_regex_match = re.match(youtube_regex, url)
    if youtube_regex_match:
        return youtube_regex_match.group(6)

    return youtube_regex_match

TESTS:

youtube_urls_test = [
    'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
    'http://youtu.be/5Y6HSHwhVlY', 
    'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
    'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&amp;hl=en_US',
    'http://www.youtube.com/',
    'http://www.youtube.com/?feature=ytca']


for url in youtube_urls_test:
    m = youtube_url_validation(url)
    if m:
        print 'OK {}'.format(url)
        print m.groups()
        print m.group(6)
    else:
        print 'FAIL {}'.format(url)

0人赞添加讨论(0) 举报

ゆ、 Hurt°

3楼-- · 2019-04-08 02:06

What if you used the urlparse module to pick apart the youtube address you find and put it back into the format you want? You could then simplify your regex so that it only finds the entire url and then use urlparse to do the heavy lifting of picking it apart for you.

from urlparse import urlparse,parse_qs,urlunparse
from urllib import urlencode
youtube_url = urlparse('http://www.youtube.com/watch?v=aFNzk7TVUeY&feature=grec_index')
params = parse_qs(youtube_url.query)
new_params = {'v': params['v'][0]}

cleaned_youtube_url = urlunparse((youtube_url.scheme, \
                                  youtube_url.netloc, \
                                  youtube_url.path,
                                  None, \
                                  urlencode(new_params), \
                                  youtube_url.fragment))

It's a bit more code, but it allows you to avoid regex madness.

And as hop said, you should use raw strings for the regex.

0人赞添加讨论(0) 举报

再贱就再见

4楼-- · 2019-04-08 02:08

Here's how I implemented it in my script:

string = "Hey, check out this video: https://www.youtube.com/watch?v=bS5P_LAqiVg"

youtube = re.findall(r'(https?://)?(www\.)?((youtube\.(com))/watch\?v=([-\w]+)|youtu\.be/([-\w]+))', string)

if youtube:
    print youtube

That outputs:

["", "youtube.com/watch?v=BS5P_LAqiVg", ".com", "watch", "com", "bS5P_LAqiVg", ""]

If you just wanted to grab the video id, for example, you would do:

video_id = [c for c in youtube[0] if c] # Get rid of empty list objects
video_id = video_id[len(video_id)-1] # Return the last item in the list

0人赞添加讨论(0) 举报

劳资没心，怎么记你

5楼-- · 2019-04-08 02:12

You should specify your regular expressions as raw strings.

You don't have to escape every character that looks special, just the ones which are.

Instead of specifying an empty branch ((foo|)) to make something optional, you can use ?.

If you want to include - in a character set, you have to escape it or put it at right after the opening bracket.

You can use special character sets like \w (equals [a-zA-Z0-9_]) to shorten your regex.

r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([-\w]+)'

Now, in order to match the whole URL, you have to think about what can or cannot follow it in the input. Then you put that into a lookahead group (you don't want to consume it).

In this example I took everything except -, =, %, & and alphanumerical characters to end the URL (too lazy to think about it any harder).

Everything between the v-argument and the end of the URL is non-greedily consumed by .*?.

r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([\w-]+)(&.*?)?(?=[^-\w&=%])'

Still, I would not put too much faith into this general solution. User input is notoriously hard to parse robustly.

0人赞添加讨论(0) 举报

Python regex convert youtube url to youtube video

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间