I'm parsing the html page, and I'm new to this kind of parsing, could you suggest me the idea to parse following html
HTML Code : http://notepad.cc/share/CFRURbrk3r
for each type of room, there are list of sub rooms so I wish to group them as Parent - Childs into the List of Objects. then later we can access to each of those childs.
this is the code as far as I could do but without adding to the Objects, besides Fizzler is there any other parser I can do in this case.
var uricontent = File.ReadAllText("TestHtml/Bew.html");
var html = new HtmlDocument(); // with HTML Agility pack
html.LoadHtml(uricontent);
var doc = html.DocumentNode;
var rooms = (from r in doc.QuerySelectorAll(".rates")
from s in r.QuerySelectorAll(".rooms")
from rd in r.QuerySelectorAll(".rate")
select new
{
Name = rd.QuerySelector(".rate-description").InnerText.CleanInnerText(),
Price = r.QuerySelector(".rate-price").InnerText.CleanInnerText(),
RoomType = s.QuerySelector("tr td h2").InnerText.CleanInnerText()
}).ToArray();
Update:
Personally, I wouldn't use an Array. I would use a List
. The implementation of a List
should allow you to add particular nodes into particular positions and grouped accordingly.
Then you could simply:
- Loop (foreach)
- Find
- Sort
- Select
Which would allow you to quickly filter through the content. Since each list item is stored. Some examples.
Update:
Another item I forgot to mention, the Html Agility Pack can do the following:
- Grab a particular node / element.
- Grab a Parent and all subsequent Children node / elements.
It can also grab remote or local pages.
I would actually download the Html Agility Pack from Nuget. It is incredibly powerful and robust, it will more than likely make it even easier to scrub the desired data. You can download it by following these steps:
- Go to Tools
- Go to Nuget Package Manager
- Select Package Manager Console
- Open the Package Manager Console in lower left of Visual Studio if it didn't open.
- Type the following command
Install-Package HtmlAgilityPack
.
A great example can be found from this question.
The premise is simple:
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
// Map the document to the Html Page.
document.Load(filePath);
// If you would rather do it through Xml String, should you require it.
if (document.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
if( bodyNode != null)
{
// Do something with bodyNode.
}
}
This example shows the syntax, but it should be far easier to grab particular nodes out of the page and manipulate it accordingly with the HtmlAgilityPack
.
Hopefully this points you in a better direction.