Gettig Htmlelement based on HtmlAgilityPack.HtmlNo

2019-07-23 06:59发布

问题:

I use HtmlAgilityPack to parse the html document of a webbrowser control. I am able to find my desired HtmlNode, but after getting the HtmlNode, I want to retun the corresponding HtmlElement in the WebbrowserControl.Document.

In fact HtmlAgilityPack parse an offline copy of the live document, while I want to access live elements of the webbrowser Control to access some rendered attributes like currentStyle or runtimeStyle

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p"); 
// this selection could be more sophisticated 
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
   HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
   // CorrespondingElementFromWebBrowserControl is what I am searching for
}

If the element had a specific attribute it could be easy but I want a solution which works on any element.

Please help me what can I do about it.

回答1:

In fact there seems to be no direct possibility to change the document directly in the webbroser control. But you can extract the html from it, mnipulate it and write it back again like this:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);

foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
    node.Attributes.Add("TEST", "TEST");
}

StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
    doc.Save(sw);
    webBrowser1.DocumentText = sb.ToString();
}

For direct manipulation you can maybe use the unmanaged pointer webBrowser1.Document.DomDocument to the document, but this is outside of my knowledge.



回答2:

HtmlAgilityPack definitely can't provide access to nodes in live HTML directly. Since you said there is no distinct style/class/id on the element you have to walk through the nodes manually and find matches.

Assuming HTML is reasonably valid (so both browser and HtmlAgilityPack perform normalization similarly) you can walk pairs of elements starting from the root of both trees and selecting the same child node.

Basically you can build "position-based" XPath to node in one tree and select it in another tree. Xpath would look something like (depending you want to pay attention to just positions or position and node name):

 "/*[1]/*[4]/*[2]/*[7]"
 "/body/div[2]/span[1]/p[3]"

Steps:

  1. In using HtmlNode you've found collect all parent nodes up to the root.
  2. Get root of element of HTML in browser
  3. for each level of children find position of corresponding child on HtmlNodes collection on step 1 in its parent and than find live HtmlElement among children of current live node.
  4. Move to newly found child and go back to 3 till found node you are looking for.


回答3:

the XPath attribute of the HtmlAgilityPack.HtmlNode shows the nodes on the path from root to the node. For example \div[1]\div[2]\table[0]. You can traverse this path in the live document to find the corresponding live element. However this path may not be precise as HtmlAgilityPack removes some tags like <form> then before using this solution add the omitted tags back using

HtmlNode.ElementsFlags.Remove("form");

struct DocNode  
{
    public string Name;
    public int Pos;
}
///// structure to hold the name and position of each node in the path

The following method finds the live element according to the XPath

    static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
    {
        var pattern = @"/(.*?)\[(.*?)\]"; // like div[1]
        // Parse the XPath to extract the nodes on the path
        var matches = Regex.Matches(node.XPath, pattern); 
        List<DocNode> PathToNode = new List<DocNode>();
        foreach (Match m in matches) // Make a path of nodes
        {
            DocNode n = new DocNode();
            n.Name = n.Name = m.Groups[1].Value;
            n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
            PathToNode.Add(n); // add the node to path 
        }

        HtmlElement elem = null; //Traverse to the element using the path
        if (PathToNode.Count > 0)
        {
            elem = doc.Body; //begin from the body
            foreach (DocNode n in PathToNode)
            {
                //Find the corresponding child by its name and position
                elem = GetChild(elem, n);                    
            }
        }
        return elem;
    }

the code for GetChild Method used above

    public static HtmlElement GetChild(HtmlElement el, DocNode node)
    {
        // Find corresponding child of the elemnt 
        // based on the name and position of the node
        int childPos = 0;
        foreach (HtmlElement child in el.Children)
        {
            if (child.TagName.Equals(node.Name, 
               StringComparison.OrdinalIgnoreCase))
            {
                if (childPos == node.Pos)
                {
                    return child;
                }
                childPos++;
            }                
        }
        return null;
    }