I have a script that parses some xml. XML contains:
<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>
How to get the 'TEXT' attribute value of tag(in my case 1417678)? I'm using regexp+Python. Regexp string:
my_value = re.findall("POPULARITY[^\d]*(\d+)", xml)
It gets to me '9511' but i need '1417678'.
You are just matching the first sequence of decimal digits that occurs after the element's name. The first sequence of digits '(\d+)'
after a arbitrary number of non-digits '[^\d]*'
is 9511
.
In order to findall
values of @TEXT
attributes, something like this would work:
my_values = re.findall("<POPULARITY(?:\D+=\"\S*\")*\s+TEXT=\"(\d*)\"", xml) # returning a list btw
Or, if no other attributes will have digit-only values except @TEXT
:
re.findall("<POPULARITY\s+(?:\S+\s+)*\w+=\"(\d+)\"", xml)
Where (?:...)
matches the embraced expression, but doesn't act as an addressable group, like (...)
. The special sequences \S
and \D
are the invertions of their lowercase counterparts, expanding to (anything but) whitespace and digits, respectively.
However, like already mentioned, regex are not meant to be used on XML, because XML is not a regular language.
You can use BeautifulSoup
import BeautifulSoup
xml = '''<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>'''
soup = BeautifulSoup.BeautifulSoup(xml)
print(soup.find('popularity')['text'])
Output
u'1417678'