Convert html to plain text in VBA

2020-01-29 09:03发布

问题:

I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out.

I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back.

回答1:

Set a reference to "Microsoft HTML object library".

Function HtmlToText(sHTML) As String
  Dim oDoc As HTMLDocument
  Set oDoc = New HTMLDocument
  oDoc.body.innerHTML = sHTML
  HtmlToText = oDoc.body.innerText
End Function

Tim



回答2:

A very simple way to extract text is to scan the HTML character by character, and accumulate characters outside of angle brackets into a new string.

Function StripTags(ByVal html As String) As String
    Dim text As String
    Dim accumulating As Boolean
    Dim n As Integer
    Dim c As String

    text = ""
    accumulating = True

    n = 1
    Do While n <= Len(html)

        c = Mid(html, n, 1)
        If c = "<" Then
            accumulating = False
        ElseIf c = ">" Then
            accumulating = True
        Else
            If accumulating Then
                text = text & c
            End If
        End If

        n = n + 1
    Loop

    StripTags = text
End Function

This can leave lots of extraneous whitespace, but it will help in removing the tags.



回答3:

Tim's solution was great, worked liked a charm.

I´d like to contribute: Use this code to add the "Microsoft HTML Object Library" in runtime:

Set ID = ThisWorkbook.VBProject.References
ID.AddFromGuid "{3050F1C5-98B5-11CF-BB82-00AA00BDCE0B}", 2, 5

It worked on Windows XP and Windows 7.



回答4:

Tim's answer is excellent. However, a minor adjustment can be added to avoid one foreseeable error response.

 Function HtmlToText(sHTML) As String
      Dim oDoc As HTMLDocument

      If IsNull(sHTML) Then
        HtmlToText = ""
        Exit Function
        End-If

      Set oDoc = New HTMLDocument
      oDoc.body.innerHTML = sHTML
      HtmlToText = oDoc.body.innerText
    End Function


回答5:

Yes! I managed to solve my problem as well. Thanks everybody/

In my case, I had this sort of input:

<p>Lorem ipsum dolor sit amet.</p>

<p>Ut enim ad minim veniam.</p>

<p>Duis aute irure dolor in reprehenderit.</p>

And I did not want the result to be all jammed together without breaklines.

So I first splitted my input for every <p> tag into an array 'paragraphs', then for each element I used Tim's answer to get the text out of html (very sweet answer btw).

In addition I concatenated each cleaned 'paragraph' with this breakline character Crh(10) for VBA/Excel.

The final code is:

Public Function HtmlToText(ByVal sHTML As String) As String
    Dim oDoc As HTMLDocument
    Dim result As String
    Dim paragraphs() As String

    If IsNull(sHTML) Then
      HtmlToText = ""
      Exit Function
    End If

    result = ""
    paragraphs = Split(sHTML, "<p>")

    For Each paragraph In paragraphs
        Set oDoc = New HTMLDocument
        oDoc.body.innerHTML = paragraph
        result = result & Chr(10) & Chr(10) & oDoc.body.innerText
    Next paragraph

    HtmlToText = result
End Function