I need to remove all tags from a html with a bash script using the sed command. I tried with this
sed -r 's/[\<][\/]?[a-zA-Z0-9\=\"\-\#\.\& ]+[\/]?[\>]//g' $1
and whith this
sed -r 's/[\<][\/]?[.]*[\/]?[\\]?[\>]//g' $1
but I still miss something, any suggestions??
You can either use one of the many HTML to text converters, use Perl regex if possible
<.+?>
or if it must besed
use<[^>]*>
If there's no room for errors, use an HTML parser instead. E.g. when an element is spread over two lines
this regular expression will not work.
This regular expression consists of three parts
<
,[^>]*
,>
<
*
, which are not the closing>
[...]
is a character class, when it starts with^
look for characters not in the class>
The simpler regular expression
<.*>
will not work, because it searches for the longest possible match, i.e. the last closing>
in an input line. E.g., when you have more than one tag in an input linewill result in
instead of
See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.