regexp for html tags with Matlab

I'm looking for a way to use regexp in order to remove all html tags from a string.
So if I have <HTML><b><FONT color="red" size="3">Hello</FONT></b></HTML> I would like to get the hello from it.

I know it will probably look like nested tags, but it's not really, because all I want to do here is to remove anything between two <>.

I'm using Matlab for doing so, but the regexp is the exact same, so feel free to contribute any help.
Thank you.

标签： regex parsing matlab tags

4条回答

ら.Afraid

2楼-- · 2020-02-13 18:27

My solution is:

>> str='<HTML><b><FONT color="red" size="3">Hello</FONT></b></HTML>';
>> regexprep(str, '<.*?>','')

ans =

Hello

0人赞添加讨论(0) 举报

祖国的老花朵

3楼-- · 2020-02-13 18:27

It is widely accepted that using regexes to parse general html is bad form. If your html is much more complicated than the example given, then you should use an XML parser instead.

Further discussion in this famous SO question. RegEx match open tags except XHTML self-contained tags.

If you want to parse the content properly, then download xml_io_tools and use

doc = xml_read('test.html')
doc.b.FONT.CONTENT

If you want to stick with regexes, then use ilya's answer, but with one of the regexes from the linked answer, e.g.,

str = '<HTML><b><FONT color="red" size="3">Hello</FONT></b></HTML>';
rx = '<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>';
regexprep(str, rx, '')

0人赞添加讨论(0) 举报

够拽才男人

4楼-- · 2020-02-13 18:30

Since you mentioned that you want to extract "hello" from the above html (say filename.html) file, you can use the following in MATLAB:

doc = xmlread('filename.html'); content = doc.item(0).getTextContent

Hope this helps!

0人赞添加讨论(0) 举报

Juvenile、少年°

5楼-- · 2020-02-13 18:36

To match such a tag

<[^>]*>

See online here at Rubular

0人赞添加讨论(0) 举报

regexp for html tags with Matlab

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间