提取HTML标签之间的文本(Extract text between HTML tags)

我有我需要提取文本许多HTML文件。如果这一切都在同一行，我能做到这一点很容易，但如果标签回绕或者是在多行我找不出如何做到这一点。这里就是我的意思是：

<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>

我不关心<br>文本，除非它会帮助周围环绕的文本。我想永远面积开始“MySection”，然后与结束</section> 。我想直到结束是这样的：

Some text here  another line here  last line of text.

我喜欢的东西就像一个VBScript或命令行选项（SED？），但我不知道从哪里开始。任何帮助吗？

Answer 1:

通常你会使用这个Internet Explorer的COM对象：

root = "C:\base\dir"

Set ie = CreateObject("InternetExplorer.Application")

For Each f In fso.GetFolder(root).Files
  ie.Navigate "file:///" & f.Path
  While ie.Busy : WScript.Sleep 100 : Wend

  text = ie.document.getElementById("MySection").innerText

  WScript.Echo Replace(text, vbNewLine, "")
Next

然而， <section>标签不支持之前IE 9，甚至在IE 9 COM对象似乎不正确地处理它，作为getElementById("MySection")只返回开始标记：

>>> wsh.echo ie.document.getelementbyid("MySection").outerhtml
<SECTION id=MySection>

你可以使用正则表达式来代替，虽然：

root = "C:\base\dir"

Set fso = CreateObject("Scripting.FileSystemObject")

Set re1 = New RegExp
re1.Pattern = "<section id=""MySection"">([\s\S]*?)</section>"
re1.Global  = False
re2.IgnoreCase = True

Set re2 = New RegExp
re2.Pattern = "(<br>|\s)+"
re2.Global  = True
re2.IgnoreCase = True

For Each f In fso.GetFolder(root).Files
  html = fso.OpenTextFile(filename).ReadAll

  Set m = re1.Execute(html)
  If m.Count > 0 Then
    text = Trim(re2.Replace(m.SubMatches(0).Value, " "))
  End If

  WScript.Echo text
Next

Answer 2:

这里使用一个班轮解决方案perl和从HTML解析器Mojolicious框架：

perl -MMojo::DOM -E '
    say Mojo::DOM->new( do { undef $/; <> } )->at( q|#MySection| )->text
' index.html

假设index.html有以下内容：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body id="portada">
<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>
</body>
</html>

它产生：

Some text here another line here last line of text.

文章来源: Extract text between HTML tags

提取HTML标签之间的文本(Extract text between HTML tags)

Answer 1:

Answer 2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮