I have to extract from a string in visual basic some text, like this:
<div id="div">
<h2 id="id-date">09.09.2010</h2> , here to extract the date
<h3 id="nr">000</h3> , here a number </div>
I need to extract the date from the div and the number all this from within the div... Also and this will be in loop, meaning there are more div block needed to be parsed.! thank you! Adrian
You should not be parsing HTML with regular expressions because HTML is not regular as stated by Daniel Vandersluis. You can use the HTML Agility Pack
Try this taken from this link -
If your
HTML tag
haveattributes
, then here is my solution:Example (using C#):
Why not just use Html Agility Pack ?
Parsing HTML with regex is not ideal. Others have suggested the HTML Agility Pack. However, if you can guarantee that your input is well-defined and you always know what to expect then using a regex is possible.
If you can make that guarantee, read on. Otherwise you need to consider the other suggestions or define your input better. In fact, you should define your input better regardless because my answer makes a few assumptions. Some questions to consider:
<div>...<h2...>...</h2><h3...>...</h3></div>
? Or can there beh1-h6
tags?hN
tags, will the date and number always be between the tags withid-date
andnr
values for theid
attribute?Depending on the answers to these questions the pattern can change. The following code assumes each HTML fragment follows the structure you shared, that it will have an
h2
andh3
with date and number, respectively, and that each tag will be on a new line. If you feed it different input it will likely break till the pattern matches your input's structure.The pattern can be on one line but I broke it up for clarity.
RegexOptions.Singleline
is used to allow the.
metacharacter to handle\n
for newlines.You also said:
Are you looping over separate strings? Or are you expecting multiple occurrences of the above HTML structure in a single string? If the former, the above code should be applied to each string. For the latter you'll want to use
Regex.Matches
and treat eachMatch
result similarly to the above piece of code.EDIT: here is some sample code to demonstrate parsing multiple occurrences.