Python string operation, extract text between html

I have a string:

<font face="ARIAL,HELVETICA" size="-2">  
JUL 28         </font>

(it outputs over two lines, so there must be a \n in there.

I wish to extract the string that's in between the <font></font> tags. In this case, it's JUL 28, but it might be another date or some other number.

1) The best way to extract the value from between the font tags? I was thinking I could extract everything in between "> and </.

edit: second question removed.

标签： python html string parsing

6条回答

我只想做你的唯一

2楼-- · 2019-01-18 16:14

Or, you could simply use Beautiful Soup:

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping

0人赞添加讨论(0) 举报

Root（大扎）

3楼-- · 2019-01-18 16:14

Use Scrapy's XPath selectors as documented at http://doc.scrapy.org/en/0.10.3/topics/selectors.html

Alternatively you can utilize an HTML parser such as BeautifulSoup especially if want to operate on the document in an object oriented manner.

http://pypi.python.org/pypi/BeautifulSoup/3.2.0

0人赞添加讨论(0) 举报

爷的心禁止访问

4楼-- · 2019-01-18 16:19

Python has a library called HTMLParser. Also see the following question posted in SO which is very similar to what you are looking for:

How can I use the python HTMLParser library to extract data from a specific div tag?

0人赞添加讨论(0) 举报

三岁会撩人

5楼-- · 2019-01-18 16:29

Is grep an option?

grep "<[^>]*>(.*)<\/[^>]*>" file

The (.*) should match your content.

0人赞添加讨论(0) 举报

甜甜的少女心

6楼-- · 2019-01-18 16:32

You have a bunch of options here. You could go for an all-out xml parser like lxml, though you seem to want a domain-specific solution. I'd go with a multiline regex:

import re
rex = re.compile(r'<font.*?>(.*?)</font>',re.S|re.M)
...
data = """<font face="ARIAL,HELVETICA" size="-2">  
JUL 28         </font>"""

match = rex.match(data)
if match:
    text = match.groups()[0].strip()

Now that you have text, you can turn it into a date pretty easily:

from datetime import datetime
date = datetime.strptime(text, "%b %d")

0人赞添加讨论(0) 举报

看我几分像从前

7楼-- · 2019-01-18 16:33

While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well.

>>> from BeautifulSoup import BeautifulSoup as BSHTML
>>> BS = BSHTML("""
... <font face="ARIAL,HELVETICA" size="-2">  
... JUL 28         </font>"""
... )
>>> BS.font.contents[0].strip()
u'JUL 28'

Then you just need to parse the date:

>>> datetime.strptime(BS.font.contents[0].strip(), '%b %d')
>>> datetime.datetime(1900, 7, 28, 0, 0)
datetime.datetime(1900, 7, 28, 0, 0)

0人赞添加讨论(0) 举报

Python string operation, extract text between html

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间