Sed remove tags from html file

2020-01-24 09:27发布

I need to remove all tags from a html with a bash script using the sed command. I tried with this

sed -r 's/[\<][\/]?[a-zA-Z0-9\=\"\-\#\.\& ]+[\/]?[\>]//g' $1

and whith this

sed -r 's/[\<][\/]?[.]*[\/]?[\\]?[\>]//g' $1

but I still miss something, any suggestions??

标签： html regex linux bash

1条回答

再贱就再见

2楼-- · 2020-01-24 10:30

You can either use one of the many HTML to text converters, use Perl regex if possible <.+?> or if it must be sed use <[^>]*>

sed -e 's/<[^>]*>//g' file.html

If there's no room for errors, use an HTML parser instead. E.g. when an element is spread over two lines

<div
>Lorem ipsum</div>

this regular expression will not work.

This regular expression consists of three parts <, [^>]*, >

search for opening <
followed by zero or more characters *, which are not the closing >
[...] is a character class, when it starts with ^ look for characters not in the class
and finally look for closing >

The simpler regular expression <.*> will not work, because it searches for the longest possible match, i.e. the last closing > in an input line. E.g., when you have more than one tag in an input line

<name>Olaf</name> answers questions.

will result in

answers questions.

instead of

Olaf answers questions.

See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.

0人赞添加讨论(0) 举报

Sed remove tags from html file

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间