Parse simple html with pure C++

2019-08-07 20:54发布

问题:

In my application I need to parse simple HTML code without using as less as possible external libs. My HTML looks like

<p> First Content is P </p><h2>Header</h2><p> Text under header </p>
<h2>Header 2</h2><p> Paragraph </p>
<h3>yep</h3><p> no </p>

My html contains only the tags p, h2, h3. I got the following structure:

struct Elements {
    std::string tag;
    std::string content;
};

std::vector<Elements> elems;

So my goal is after parsing each Elements in the vector should contain data like this:

tag = "h2"
content = "Header"

and

tag = "p"
content = "First Content is P"

PP: I need to get the elements in the order they're presented in the HTML.

Edit:

I just did this in javascript and it's working fine, but I have basically no idea how to write it down in c++:

var a = "<p> First Content is P </p><h2>Header</h2><p> Text under header </p>" +
    "<h2>Header 2</h2><p> Paragraph </p>" +
    "<h3>yep</h3><p> no </p>";

var output = [];

a.replace(/<\b[^>]*>(.*?)<\/(.*?)>/gmi, function(m, key, value) {
    output.push({
        tag: value,
        data: key
    });
})

/*
    output:
        { tag: "p", data: "First Content is P"},
        { tag: "h2", data: "Header" }
        .....
 */

回答1:

There are only those three elements, and no missing close tags. It looks as if furthermore there are no attributes on the tags, and aren't even any elements inside elements. There's no whitespace inside tags either.

Then you are not parsing HTML. You are parsing a special language that is a subset of HTML (well, not even really a subset since your document doesn't validate).

You might have a good reason not to want to use an HTML parser to parse this special language. For example, the code for a full HTML parser is large-ish and perhaps wouldn't otherwise need to be on the very tiny embedded device you're writing for. More likely this is a learning assignment, and the goal is for you to manipulate strings not to choose the best tool to produce the output you need. I will assume that you must avoid using an HTML library without further consideration why.

So, how to parse this special language? How to parse anything. Given all the restrictions I have listed above, you could do it very simply:

  • Look for the first instance in the string of any one of three substrings <p>, <h2>, <h3>. This is your opening tag.
  • Find the first instance of the corresponding close tag.
  • Everything between is the contents of the element. In your example you additionally trim whitespace at each end of the content. Construct an Elements object and add it to your vector (btw consider using a singular class name, not plural).
  • Repeat on the remainder of the string.

That's it. You could do that using a regular expression, but my general feeling is that since you said you wanted to do it in C++ then you may as well just do it in C++. No need to bring another language into it, and whatever the merits and limits of regexes, they certainly are another language.

However, maybe the extra limits I listed above aren't guaranteed. What if you later want to support spaces inside tags? And attributes? And XML namespaces? And comments? Then you'll wish you'd just used an HTML parser. Therefore what you do for a fixed trivial subset of HTML is different from what you do for a significant subset or one that might become significant in future.



回答2:

Just a suggestion. To speedup parser, change struct Elements to something like

struct Node { const char * ptrToNodeStart; int nodeLen; Entity() ... etc}

struct Elements {
Node tag;
Node content; };

The main idea is to avoid memory allocation for tags and content because you already have whole document in memory. Just keep it there and operate with pointers. It is much faster. With pointers, parsing procedure will end up before single allocation completed. When your parser runs through the document, it will create new Node (will take from preallocated pool) and will put current ptr to Node::ptrToNodeStart. When new node occured (or current is closed) you fix Node::nodeLen and complete with Element. This is the idea. Serious problem with struct Elements, it does not fit to HTML structure because HTML node normally includes other nodes, so it requires Elements to be nested. Parsing HTML is interesting task even there are tons of parsers already on the market. Good luck.