I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPath, but after looking into how Wikipedia builds their articles, I quickly discovered that wouldn't be so easy. The best way to separate content when I get the page, is to select what's between two sets of h2
tags.
Example:
<h2>Title</h2> <div>Some Content</div> <h2>Title</h2>
Here I would want to get the div
between the sets of headers. I tried doing this with XPath, but with no luck at all. I am going to look more into XPath because I think that's what I need to use to achieve what I want, but before I look too much into it, I would like to hear what you guys think about it. Is XPath the right way to go or do I have other easier options? I write the application in C# if that makes any difference.
With the help from kjhughes suggestion, I managed to get the code working.
I was unable to make the
= 'Text'
part work, but replaced it with[text() = 'text']
That alone wasn't enough, as the title of the content I need is location inside a
span
in ah2
tag, so I had to adapt the XPath a bit more.This is what I came up with:
I tested it using http://www.xpathtester.com/xpath on this HTML:
Which gave me the following result:
Yes, you're on the right track with XPath -- it's ideal for selecting parts of an XML document.
For example, for this XML,
this XPath,
will select this content,
between the two
h2
titles, as requested.Update to address OP's self-answer:
For this new XML example,
the XPath I provided above can easily be adapted,
to select this XML,
as requested.