How to get value of specified tag attribute from X

2020-05-10 09:07发布

问题:

I have a script that parses some xml. XML contains:

<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>

How to get the 'TEXT' attribute value of tag(in my case 1417678)? I'm using regexp+Python. Regexp string:

my_value = re.findall("POPULARITY[^\d]*(\d+)", xml)

It gets to me '9511' but i need '1417678'.

回答1:

You are just matching the first sequence of decimal digits that occurs after the element's name. The first sequence of digits '(\d+)' after a arbitrary number of non-digits '[^\d]*' is 9511.

In order to findall values of @TEXT attributes, something like this would work:

my_values = re.findall("<POPULARITY(?:\D+=\"\S*\")*\s+TEXT=\"(\d*)\"", xml) # returning a list btw

Or, if no other attributes will have digit-only values except @TEXT:

 re.findall("<POPULARITY\s+(?:\S+\s+)*\w+=\"(\d+)\"", xml)

Where (?:...) matches the embraced expression, but doesn't act as an addressable group, like (...). The special sequences \S and \D are the invertions of their lowercase counterparts, expanding to (anything but) whitespace and digits, respectively.

However, like already mentioned, regex are not meant to be used on XML, because XML is not a regular language.



回答2:

You can use BeautifulSoup

import BeautifulSoup

xml = '''<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>'''

soup = BeautifulSoup.BeautifulSoup(xml)

print(soup.find('popularity')['text'])

Output

u'1417678'