python extracting HTML tag attributes without regu

2020-07-30 02:27发布

Is there any way using urlib, urllib2 or BeautifulSoup to extract HTML tag attributes?

for example:

<a href="xyz" title="xyz">xyz</a>

gets href=xyz, title=xyz

There is another thread talking about using regular expressions

Thanks

标签： python html-parsing beautifulsoup

2条回答

▲ chillily

2楼-- · 2020-07-30 03:12

You could use BeautifulSoup to parse the HTML, and for each <a> tag, use tag.attrs to read the attributes:

In [111]: soup = BeautifulSoup.BeautifulSoup('<a href="xyz" title="xyz">xyz</a>')

In [112]: [tag.attrs for tag in soup.findAll('a')]
Out[112]: [[(u'href', u'xyz'), (u'title', u'xyz')]]

0人赞添加讨论(0) 举报

倾城　Initia

3楼-- · 2020-07-30 03:21

why don't you try with the HTMLParser module?

Something like this:

import HTMLParser
import urllib

class parseTitle(HTMLParser.HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for names, values in attrs:
                if name == 'href':
                    print value # or the code you need.
                if name == 'title':
                    print value # or the code you need.



aparser = parseTitle()
u = urllib.open('http://stackoverflow.com') # change the address as you like
aparser.feed(u.read())

0人赞添加讨论(0) 举报

python extracting HTML tag attributes without regu

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间