How to get value of specified tag attribute from X

2020-05-10 09:01发布

I have a script that parses some xml. XML contains:

<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>

How to get the 'TEXT' attribute value of tag(in my case 1417678)? I'm using regexp+Python. Regexp string:

my_value = re.findall("POPULARITY[^\d]*(\d+)", xml)

It gets to me '9511' but i need '1417678'.

2条回答
We Are One
2楼-- · 2020-05-10 09:16

You can use BeautifulSoup

import BeautifulSoup

xml = '''<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>'''

soup = BeautifulSoup.BeautifulSoup(xml)

print(soup.find('popularity')['text'])

Output

u'1417678'
查看更多
倾城 Initia
3楼-- · 2020-05-10 09:17

You are just matching the first sequence of decimal digits that occurs after the element's name. The first sequence of digits '(\d+)' after a arbitrary number of non-digits '[^\d]*' is 9511.

In order to findall values of @TEXT attributes, something like this would work:

my_values = re.findall("<POPULARITY(?:\D+=\"\S*\")*\s+TEXT=\"(\d*)\"", xml) # returning a list btw

Or, if no other attributes will have digit-only values except @TEXT:

 re.findall("<POPULARITY\s+(?:\S+\s+)*\w+=\"(\d+)\"", xml)

Where (?:...) matches the embraced expression, but doesn't act as an addressable group, like (...). The special sequences \S and \D are the invertions of their lowercase counterparts, expanding to (anything but) whitespace and digits, respectively.

However, like already mentioned, regex are not meant to be used on XML, because XML is not a regular language.

查看更多
登录 后发表回答