如何提取HTML标签之间的文本有或条件多时报(How To Extract Text Between

我一直在研究如何从HTML中提取标题标签。我已经差不多想通了，正则表达式和HTML不混合，可以使用grep的。然而，我发现代码在这里，看起来是这样的：

awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}'

现在，这个工程找到标题标签之间的文本只有一次。我想知道我怎样才能使它在每行上运行。我可以做一个cat file; while read line; do ...; done cat file; while read line; do ...; done cat file; while read line; do ...; done 。但是，我知道，可能不是非常有效的，有一种更好的方式。

其次，在文件中我需要保持与字符串开始的任何行“ - ”。我相信这需要增加一个“或”声明awk所以它将匹配标题标签和启动任何线“ - ”

输入文件应该是这样的：

text text text <title>random text of the title 1</title> random html stuff
--time--
xyz more random text <title>random text of the title 2</title> hmtl text
--time--
some text <title>random text of the title 3</title> more text tags
--time--
text here <title>random text of the title 4</title> random text html
--time--

所需的输出：

<title>random text of the title 1</title>
--time--
<title>random text of the title 2</title>
--time--
<title>random text of the title 3</title>
--time--
<title>random text of the title 4</title>
--time--

我不是awk的，伟大的，但我学习。我知道应该有打印所有的选项，但它是我真的很粘在OR语句。我打开的sed或者用grep如果你认为这是更有效的。任何帮助或方向是极大的赞赏。

为了您给定的输入， grep足够

$ grep -o '<.*>\|^--.*' ip.html 
<title>random text of the title 1</title>
--time--
<title>random text of the title 2</title>
--time--
<title>random text of the title 3</title>
--time--
<title>random text of the title 4</title>
--time--