Get only the first link of a URLs list with Beauti

2019-02-25 09:00发布

I parsed an entire HTML file, extracting some URLs with Beautifulsoup module in Python, with this peace of code:

for link in soup.find_all('a'):
    for line in link :
        if "condition" in line :

           print link.get("href")

and i get in the shell a series of links that observe the condition in the if loop:

http:// ..link1
http:// ..link2
.
.
http:// ..linkn

how can i put in a variable "output" only the first link of this list?

EDIT:

The web page is : http://download.cyanogenmod.com/?device=p970 , the script have to return the first short URL (http://get.cm/...) in the HTML page.

标签： python parsing url beautifulsoup

2条回答

Melony?

2楼-- · 2019-02-25 10:00

You can do this more easily and clearly in BeautifulSoup without loops.

Assuming your parsed BeautifulSoup object is named soup:

output = soup.find(lambda tag: tag.name=='a' and "condition" in tag).attrs['href']
print output

Note that the find method returns only the first result, while find_all returns all of them.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

3楼-- · 2019-02-25 10:02

You can do it with a oneliner:

import re

soup.find('a', href=re.compile('^http://get.cm/get'))['href']

to assign it to a variable just:

variable=soup.find('a', href=re.compile('^http://get.cm/get'))['href']

I have no idea what exactly are you doing so i will post the full code from scratch: NB! if you use bs4 change the imports

import urllib2
from BeautifulSoup import BeautifulSoup
import re

request = urllib2.Request("http://download.cyanogenmod.com/?device=p970")
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
variable=soup.find('a', href=re.compile('^http://get.cm/get'))['href']
print variable

>>> 
http://get.cm/get/4jj

0人赞添加讨论(0) 举报

Get only the first link of a URLs list with Beauti

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间