Parsing HTML in excel cell to multiple cells

2020-06-30 06:12发布

问题:

I have the following - really messy looking - data in my excel cell, which consists of around 100 rows of HTML Tags:

Find below two examples:

<ul class=""list-unstyled"">
    <li><span title=""Website"" class=""glyphicon glyphicon-link text-gray""></span> <a href=""https://google.org/"" target=""_blank"">Website</a></li>
    <li><span title=""Website"" class=""glyphicon glyphicon-link text-gray""></span> <a href=""https://www.google.com/"" target=""_blank"">Website 2</a></li>
    <li><span title=""Product"" class=""glyphicon glyphicon-search text-gray""></span> <a href=""http://amazon.com"" target=""_blank"">Product</a></li>
    <li><span title=""Product"" class=""glyphicon glyphicon-search text-gray""></span> <a href=""https://amazon.de/"" target=""_blank"">Product 2</a></li>          
    <li><span title=""Tags"" class=""glyphicon glyphicon glyphicon-tag text-gray""></span>
        <small><span class=""label label-warning"">Available</span></small>
        <small><span class=""label label-warning"">Country</span></small>
    </li>
</ul>

or

<ul class=""list-unstyled"">
    <li><span title=""Website"" class=""glyphicon glyphicon-link text-gray""></span> <a href=""https://google.org/"" target=""_blank"">Website</a></li>
    <li><span title=""Website"" class=""glyphicon glyphicon-link text-gray""></span> <a href=""https://www.google.com/"" target=""_blank"">Website 2</a></li>
    <li><span title=""Product"" class=""glyphicon glyphicon-search text-gray""></span> <a href=""http://amazon.com"" target=""_blank"">Product</a></li>    
    <li><span title=""Tags"" class=""glyphicon glyphicon glyphicon-tag text-gray""></span>
        <small><span class=""label label-warning"">Not Available</span></small>
        <small><span class=""label label-warning"">State</span></small>
    </li>
</ul>

My goal is to create a table that looks like the following:

| Website 1           | Website 2           | Website 3 | Product 1         | Product 2          | Product 3 | Available     | Country |
|---------------------|---------------------|-----------|-------------------|--------------------|-----------|---------------|---------|
| https://google.org/ | https://google.com/ |           | http://amazon.com | https://amazon.de/ |           | Available     | Country |
| https://google.org/ | https://google.com/ |           | http://amazon.com |                    |           | Not Available | State   |

I honestly have no clue how to approach this challenge.

Any suggestions from your side?

回答1:

The approach is: create function, that will take HTML code as string as a parameter and will return dictionary with keys same as your table headers. The body of a function is:

Function ParseHTML(str As String) As Scripting.Dictionary
Set ParseHTML = New Scripting.Dictionary

Dim txt As String
Dim website As Long: website = 0
Dim product As Long: product = 0
Dim i As Long: i = 0

Do While True

    'get all text between <li> and <\li> tags
    'then extract all data from it: title attribute and link
    txt = Mid(str, InStr(1, str, "<li>") + 4, InStr(1, str, "</li>") - InStr(1, str, "<li>") - 4)
    'select which case it is: website, product or tags
    Select Case Mid(txt, InStr(1, txt, "title") + 8, InStr(1, txt, "class") - InStr(1, txt, "title") - 11)
        Case Is = "Website"
            website = website + 1
            'here you extract the link
            ParseHTML.Add "Website " & website, Mid(txt, InStr(1, txt, "<a href") + 10, InStr(1, txt, "target") - InStr(1, txt, "<a href") - 13)
        Case Is = "Product"
            product = product + 1
            'here you extract the link
            ParseHTML.Add "Product " & product, Mid(txt, InStr(1, txt, "<a href") + 10, InStr(1, txt, "target") - InStr(1, txt, "<a href") - 13)
        Case Is = "Tags"
            'if we reached Tags, then all websites are over and need different processing
            Exit Do
    End Select
    'delete processed text
    str = Mid(str, InStr(1, str, "</li>") + 5)

Loop

'since in your table you have 3 places for websites and products, so we need to add them
For i = website + 1 To 3
    ParseHTML.Add "Website " & i, ""
Next i
For i = product + 1 To 3
    ParseHTML.Add "Product " & i, ""
Next i

'now txt is the interior of last <li></li> tag and now we focus on what is
'between <small> and </small> tags
'also we don't need str variable anymore, so we can re-use it
str = Mid(txt, InStr(1, txt, "<small>") + 7, InStr(1, txt, "</small>") - InStr(1, txt, "<small>") - 7)
ParseHTML.Add "Available", Mid(str, InStr(1, str, ">") + 1, Len(str) - InStr(1, str, ">") - 7)
'remove processed part of html
txt = Mid(txt, InStr(1, txt, "</small>") + 8)
'take care of last <small> tag
str = Mid(txt, InStr(1, txt, "<small>") + 7, InStr(1, txt, "</small>") - InStr(1, txt, "<small>") - 7)
ParseHTML.Add "Country", Mid(str, InStr(1, str, ">") + 1, Len(str) - InStr(1, str, ">") - 7)


End Function

So, to sum up, the function returns dictionary with keys "Website 1", "Website 2", "Website 3", "Product 1", "Product 2", "Product 3", "Available", "Country".

Now, having that function, it's easy to fill the table you want. Here's one way of doing it:

Sub ProcessHTML()
'determine last row in A column
Dim lastRow As Long: lastRow = Cells(Rows.Count, 1).End(xlUp).Row
Dim dict As Scripting.Dictionary
Dim i As Long
Dim j As Long

For i = 2 To lastRow
    'parse HTML code with our function
    Set dict = ParseHTML(Cells(i, 1).Value)
    For j = 2 To 9
        'write corresponding values from dictionary to cells in table
        Cells(i, j).Value = dict(Cells(1, j).Value)
    Next j
    'get rid of object
    Set dict = Nothing
Next i
End Sub

It works with table arranged like that (already filled):

It's very important to have these headers in column.

Important

Before running anything, in your VBA editor go to: Tools -> References, and in the window that will pop-up you need to select Microsoft Scripting Runtime.



回答2:

I have a few ideas:

If you do this in all VBA ( don't use any libraries), you could search the html as string & look for the <a> & </a> tags. Once you pull out substrings that have this:<a href=""https://google.org/"" target=""_blank"">Website</a> then you do more substring searches for the href & url.

Another option is to use regular expressions. It looks like the VBA script DLL has regex capabilities, you can look at that.

One final option is the HTML agility pack. That is designed for working with HTML. I used it from a .net project once. I don't remember the details now, but I remember it being straightforward to work with.



回答3:

Sub splithtml()
Dim htmlstring As String
Dim rowcount As Integer
Dim website1str As String, website2str As String, website3str As String
Dim product1str As String, product2str As String
Dim spanstr As String

'All the Attribute Nodes to be extracted are hardcoded

 website1str = ">Website</a></li>"
 website2str = ">Website 2</a></li>"
 website3str = ">Website 3</a></li>"
 product1str = ">Product</a></li>"
 product2str = ">Product 2</a></li>"
 spanstr = "</span></small>"

'Create Headers for the xml parsed table
 Cells(1, 2).Value = "Website 1"
 Cells(1, 3).Value = "Website 2"
 Cells(1, 4).Value = "Website 3"
 Cells(1, 5).Value = "Product 1"
 Cells(1, 6).Value = "Product 2"
 Cells(1, 7).Value = "Available"
 Cells(1, 8).Value = "Country"

'Get the number of rows with data in A column
'Assmption:- XML data stored in A column of the sheet

rowcount = Cells(Rows.Count, 1).End(xlUp).row

For i = 2 To rowcount + 1
'Xml is stored in A column and starts from second row, First row is assumed to be header
 htmlstring = Cells(i, 1).Value
'Parses each node and stores in the adjacent column of the column where XML is stored

   htmlstring = GetValue(htmlstring, website1str, i, 2)

   htmlstring = GetValue(htmlstring, website2str, i, 3)

   htmlstring = GetValue(htmlstring, website3str, i, 4)

   htmlstring = GetValue(htmlstring, product1str, i, 5)

   htmlstring = GetValue(htmlstring, product2str, i, 6)

   htmlstring = GetValue(htmlstring, spanstr, i, 7)

   htmlstring = GetValue(htmlstring, spanstr, i, 8)


Next i
End Sub



Function Trimhtml(Mainhtml, Processedhtml)
'Function to  Trim the HTMl nodes that has been parsed
 Dim spanstr As String
 spanstr = "</span></small>"
     Trimhtml = Mainhtml
    If Processedhtml = spanstr Then
      Trimhtml = Mid(Mainhtml, InStr(Mainhtml, Processedhtml) + 15)
    Else
      Trimhtml = Mid(Mainhtml, InStr(Mainhtml, Processedhtml))
    End If
End Function


Function GetValue(Mainhtml, nodevalue, row, column)
'Function to Get Text value from the attribute passed and stored in the row, column passed
 Dim nodestring As String
 Dim FirstPoint As Integer, Secondpoint As Integer
 Dim spanstr As String
 spanstr = "</span></small>"

  If InStr(Mainhtml, nodevalue) > 0 Then
     nodestring = Left$(Mainhtml, InStr(Mainhtml, nodevalue))
     If nodevalue = spanstr Then
       FirstPoint = InStrRev(nodestring, ">")
       Secondpoint = InStrRev(nodestring, "<")
       Returnvalue = Mid(nodestring, FirstPoint + 1, Secondpoint - FirstPoint - 1)
      Else
        FirstPoint = InStr(nodestring, "<a href=")
        Secondpoint = InStr(nodestring, "target=")
        Returnvalue = Mid(nodestring, FirstPoint + 10, Secondpoint - FirstPoint - 13)
       End If
    Cells(row, column).Value = Returnvalue
    GetValue = Trimhtml(Mainhtml, nodevalue)
   Else
    GetValue = Mainhtml
  End If
End Function

I have written a VB script to parse the xml data.

Assumptions:-

  1. Your XML data is stored in COlumn A from Second Row (First row is header)

  2. Xml data is parsed and stored in the adjacent columns. For xml data in A2, the parsed rows are stored in B2:H2

  3. Nodes Website, Website2, Website3, Product1, Product2, Available and Country can only be derived from this code.

  4. If you want to add more nodes in future, Create a replica of the if loop for the new node



回答4:

Assuming that your data is in Cell A2 and you are applying formula in Cell B2 for Websites you can use below formula.

    =IF((LEN($A2)-LEN(SUBSTITUTE($A2,"""""Website""""","")))/(LEN("Website")+4)>=COLUMNS($B$1:B1),TRIM(MID(SUBSTITUTE(SUBSTITUTE($A2,"<a href=""""",REPT(" ",LEN($A2)),COLUMNS($B$1:B1)),""""" target",REPT(" ",LEN($A2)),COLUMNS($B$1:B1)),LEN($A2),LEN($A2))),"")

Copy down and across.

And for Products in cell E2

    =IF((LEN($A2)-LEN(SUBSTITUTE($A2,"""""Product""""","")))/(LEN("Product")+4)>=COLUMNS($E$1:E1),TRIM(MID(SUBSTITUTE(SUBSTITUTE(MID($A2,FIND("""""Product""""",$A2,1),LEN($A2)),"<a href=""""",REPT(" ",LEN($A2)),COLUMNS($E$1:E1)),""""" target",REPT(" ",LEN($A2)),COLUMNS($E$1:E1)),LEN($A2),LEN($A2))),"")

Benefit: It uses native Excel functions so there's no need for VBA. And it is non-array formula i.e. doesn't need CTRL+SHIFT+ENTER.

Disadvantage: Formula is complicated and may prove difficult to maintain.

I have uploaded sample file on Dropbox for your ease of understanding and implementation.

Drop Box Link to sample file



回答5:

You can do this in Excel, if that's what you're looking for.

First, use Text to Columns to parse the data.

  1. In text to columns, select Delimited and hit next
  2. Uncheck all the boxes under Delimiters and check Other
  3. Check Other and enter a double quote in the text box
  4. Hit finish
  5. Copy the rows beginning with
  6. (just the data, not the whole row)
  7. Paste Special elsewhere in the spreadsheet and check Transpose
  8. Remove the blank lines

Hope this is what you were looking for