I have a string:
<font face="ARIAL,HELVETICA" size="-2">
JUL 28 </font>
(it outputs over two lines, so there must be a \n in there.
I wish to extract the string that's in between the <font></font>
tags. In this case, it's JUL 28, but it might be another date or some other number.
1) The best way to extract the value from between the font tags? I was thinking I could extract everything in between ">
and </
.
edit: second question removed.
Or, you could simply use Beautiful Soup:
Use Scrapy's XPath selectors as documented at http://doc.scrapy.org/en/0.10.3/topics/selectors.html
Alternatively you can utilize an HTML parser such as BeautifulSoup especially if want to operate on the document in an object oriented manner.
http://pypi.python.org/pypi/BeautifulSoup/3.2.0
Python has a library called
HTMLParser
. Also see the following question posted in SO which is very similar to what you are looking for:How can I use the python HTMLParser library to extract data from a specific div tag?
Is grep an option?
The (.*) should match your content.
You have a bunch of options here. You could go for an all-out xml parser like lxml, though you seem to want a domain-specific solution. I'd go with a multiline regex:
Now that you have
text
, you can turn it into a date pretty easily:While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well.
Then you just need to parse the date: