How to get value of specified tag attribute from X

I have a script that parses some xml. XML contains:

<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>

How to get the 'TEXT' attribute value of tag(in my case 1417678)? I'm using regexp+Python. Regexp string:

my_value = re.findall("POPULARITY[^\d]*(\d+)", xml)

It gets to me '9511' but i need '1417678'.

标签： python regex python-2.7 xml-parsing

2条回答

We Are One

2楼-- · 2020-05-10 09:16

You can use BeautifulSoup

import BeautifulSoup

xml = '''<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>'''

soup = BeautifulSoup.BeautifulSoup(xml)

print(soup.find('popularity')['text'])

Output

u'1417678'

0人赞添加讨论(0) 举报

倾城　Initia

3楼-- · 2020-05-10 09:17

You are just matching the first sequence of decimal digits that occurs after the element's name. The first sequence of digits '(\d+)' after a arbitrary number of non-digits '[^\d]*' is 9511.

In order to findall values of @TEXT attributes, something like this would work:

my_values = re.findall("<POPULARITY(?:\D+=\"\S*\")*\s+TEXT=\"(\d*)\"", xml) # returning a list btw

Or, if no other attributes will have digit-only values except @TEXT:

 re.findall("<POPULARITY\s+(?:\S+\s+)*\w+=\"(\d+)\"", xml)

Where (?:...) matches the embraced expression, but doesn't act as an addressable group, like (...). The special sequences \S and \D are the invertions of their lowercase counterparts, expanding to (anything but) whitespace and digits, respectively.

However, like already mentioned, regex are not meant to be used on XML, because XML is not a regular language.

0人赞添加讨论(0) 举报

How to get value of specified tag attribute from X

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间