Detecting HTML5 paragraph endings — HTML Serializa

2019-07-09 23:28发布

问题:

I would like to write a Perl program that makes reasonable HTML5 markup in $_ better (“more valid” – I know, it sounds like “more pregnant”). Specifically I want to try to properly close paragraphs with </p> tags, just where browsers would close them. Its a step on the way to convert html to xhtml. This helps me in subsequent text analysis of full paragraphs.

The HTML5 spec says that

  1. A p element must have a start tag.

  2. A p element’s end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, dir, div, dl, fieldset, footer, form, h1, h2, h3, h4, h5, h6, header, hr, menu, nav, ol, p, pre, section, table, or ul element,

  3. or if there is no more content in the parent element and the parent element is not an a element.

Problems:

  1. I believe it is possible to see paragraphs where this is not true. HTML browsers infer and insert <p> by themselves. For example, <h1>HEADER</h1> Now is… will insert a <p> just before the Now is…. Am I mistaken?

  2. Let’s assume the HTML content creator has already inserted the <p> correctly. I now need to search forward until it ends. Detecting an opening from the list of 26 tags which close a paragraph is easy.

  3. But how can I detect if there is more content in the parent paragraph? Can I just search for the next </…> from the set of the above 26 tags, or do I need to code a full stack machine (assuming all contents in paragraphs themselves are valid XHTML) to detect the end of enclosing container?

Thanks to @Palec I now understand that paragraphs are an odd concept in HTML. Try this:

<!DOCTYPE html>
<html>
<head>
<style>
    p { color: blue; }
    p:before { content:"[SP]"; }
    p:after { content:"[EP]"; }
</style>
</head>

<body>

l0

<h1> h1 </h1>

l0

<p> para

<p> para </p>

l0

<p>para
<ol>
<li> l0 <p> para </li>
</ol>
l0

</body>
</html>

This shows that not all text is at least a paragraph. I did confuse it with the LaTeX concept… and thought that whatever was at “level 0” was a paragraph by default. It is not.

回答1:

Three concepts of paragraph

HTML 5 has two separate concepts: p element and paragraph. I will call this paragraph a structural paragraph. In real world I found at least two other related concepts: logical paragraph and typographical paragraph.

p element is clear. You know it, you already quoted its description from the spec.

(Structural) paragraph is somewhat strange concept to me. Maybe it is used by screen readers or whatever. Its definition basically says that it is a non-empty run of phrasing content not interrupted by other types of content (not taking a, ins, del and map into account).

Logical paragraph is what I think human beings consider a paragraph. It is a unit of text that carries a single thought. When another (probably related) thought begins, the paragraph breaks and a new one begins. It is composed from a sequence of sentences.

Each sentence can have not only its linguistic structure, but also can contain formatting. Formatting is not limited to what HTML calls phrasing content, but I’ll add at least multi-line preformatted code snippets, lists, math formulas (possibly spanning multiple lines, display math from TeX) and anything else that can be used in the middle of a sentence or between sentences while not breaking the train of thought. This big difference between logical paragraph and the other two concepts can be seen in my question List or longer code snippet inside paragraph.

Typographical paragraph consists of sequence of lines, not sentences, and can contain whatever the typographical system can handles inside. I originally thought it is exactly the same concept as logical paragraph, but it is not.

It came to my mind when thinking about tex. You may know it from latex that is just a large set of definitions for TeX and has the same notion of paragraph. Content is buffered till \par (or empty line which translates to \par internally) is met, then it is flushed to the output as a single paragraph. What looks like one (logical) paragraph can be internally several paragraphs as it has to be used to implement some more complicated behavior of the typesetting algorithm. From this point of view it resembles more a structural paragraph.

Answers to your questions

  1. A (structural) paragraph begins after h1 element if just a text node is present. But this is not a p element. It cannot be styled in CSS using p selector, it is not present in the DOM tree of the document etc.

    There are certain places where element tags are not in the markup but still the elements are created. This is the case with those elements whose start tag can be omitted. These are html, head, body, colgroup and tbody. (At least tbody used to behave differently in HTML 4, this behavior comes from XHTML. In HTML it just need not exist.) p element is not the case, however.

  2. If the content creator did not insert <p> correctly (it was not valid HTML 5), how would you be supposed to correct it? Once it is not correct, you cannot generally assume anything about it. Also omitting the end tag is not incorrect! Not a question really in this list item, so going further…

  3. Are you really assuming valid XHTML 5 (i.e. XML serialization of HTML 5, specifically all tags closed)? OK, then you need to track document tree depth info (or stack if you need the data in structured form). Otherwise you would have to implement full HTML 5 parsing as there might be e.g. option with omitted end tag inside (within a select). This would break your depth tracking.

    The paragraph closes when one of the named elements starts or when </p> closing tag is met or when end of parent element is met. Mmmm. When you assume valid XHTML only inside, you still need to implement closing rules for all elements to be able to detect end of parent element… This will not be easy.

HTML to XML serialization conversion of HTML 5

In a comment you said that converting HTML 5 to XHTML 5 is your use case.

Do not use regexes!

Regexes were not designed to do such complicated tasks as parsing HTML. Anything you try would be just a heuristic. True regular expressions cannot parse HTML at all, because HTML is not a regular language. Let’s put aside that perlre is much more powerful; with great power comes great responsibility and you should not use the power when it is wrong. There is an extremely famous answer to a question on this topic here on SO, real piece of art. Jeff Atwood wrote more on the topic, quoting this answer at the beginning and explaining the importance of understanding your tools in the rest of the article.

I believe that text-level approach to this goal is bad. HTML is often referred to as tag soup and in contrast with what Wikipedia says, I met this term used in reference to text-level approach to its creation and amending generally (namely document.write() and element.innerHTML).

By the way this is one thing that XHTML solved really well by abolition. In JavaScript you can’t use document.write() with XHTML. If it works, you are using HTML parser with XHTML document – use Content-Type HTTP header with application/xhtml+xml; charset=utf-8 instead of text/html MIME type you use.

Use DOM

The Clean Solution™ is DOM.

I believe you should implement (or use other’s implementation of) a HTML parser, get the DOM tree, and write a serializer to XHTML. If the input document is not valid, reject to process it. Or add switches to your program, that tells it how to fix certain errors that the parsing algorithm is not designed to handle. There could be many ways.

I am not sure which parts of the spec you are free to ignore if you are not interested in them. The parsing algorithm is standardized and the error handling is specified too. You could find a shortcut where you don’t need to create a part of the DOM tree and just leave the corresponding part of input unparsed, but you have to be sure that you continue parsing at the right position of input. This could get messy and is definitely error-prone. Therefore I recommend you not to do that.

Practical solution

In practice, it seems you can use at least two existing modules.

Mojolicious is web framework that contains Mojo::DOM module. If you do not need DOM manipulation and you want just parsing and serialization, you could use the underlying Mojo::DOM::HTML. HTML can be parsed by Mojo::DOM using my $dom = Mojo::DOM->new($html_markup);, the resulting DOM object can be set to use XML serialization by $dom->xml(1); and the serialization can be returned as $xhtml_markup = "$dom"; or $xhtml_markup = $dom->to_string();. From Mojo::DOM POD: “Mojo::DOM is a minimalistic and relaxed HTML/XML DOM parser with CSS selector support. It will even try to interpret broken XML, so you should not use it for validation.” Example use in answer by amon. You may want to use this solution if you already use Mojolicious, otherwise installing whole big framework is an overkill for this job.

HTML::HTML5::Parser and HTML::HTML5::Writer modules can be used for parsing and serialization of HTML 5 respectively. They seem to have only a few dependencies. Nice code using these can be found in answer by tobyink, their author. This should be a solution for those not using Mojolicious already.



回答2:

  1. Tags are not inferred during parsing. It's OK for most elements to contain text that isn't included in another tag. You might want to look at the Document Object Model which lies beneath the HTML syntax. There are not only element nodes, but also text nodes.

  2. Yes, it is that easy.

  3. Reorder the problem so that a tag is not closed by a closing tag that may be missing, but that a tag is closed when there is no more input left that belongs inside the tag. Once a tag has been closed, a directly following corresponding closing tag would be discarded.

However, you shouldn't try to make the HTML “more valid”. Either it is valid, or it isn't. HTML5 includes many error-correction rules (one of which this question is about). If there's nothing in the spec, this likely means that it's impossible to fix sensibly.

Also, there already exist many good HTML parsers. For example, with Mojolicious you can do:

use Mojo;

my $bad_html = <<'END';
<p> foo
<p> bar
END

my $dom = Mojo::DOM->new($bad_html);  # parse it into a data structure
my $good_html = "$dom";  # stringifying the data structure makes it good HTML

Output:

<p> foo
</p><p> bar
</p>


回答3:

OK, this seems to work for me...

#!/usr/bin/env perl

use strict;
use warnings;
use HTML::HTML5::Parser;
use HTML::HTML5::Writer;

my $parser = HTML::HTML5::Parser->new;
my $writer = HTML::HTML5::Writer->new(polyglot => 1);

my $dom = $parser->load_html(IO => \*DATA);

# Loop through all the elements that contain a paragraph
for my $e ( $dom->findnodes('//*[local-name()="p"]/..') )
{
   # Find any text that's floating around free in that element
   for my $t ( $e->findnodes('./text()') )
   {
      # Strip out excess whitespace
      my $text = $t->data;

      # Create a new paragraph element containing the text
      my $new_node = $e->addNewChild($e->namespaceURI, 'p');
      $new_node->appendText($text);

      # Replace free text with a nice paragraph
      $t->replaceNode($new_node);
   }
}

print $writer->document($dom), "\n";

__DATA__
<!DOCTYPE html>
<html>
<head>
<style>
    p { color: blue; }
    p:before { content:"[SP]"; }
    p:after { content:"[EP]"; }
</style>
</head>

<body>

l0

<h1> h1 </h1>

l0

<p> para

<p> para </p>

l0

<p>para
<ol>
<li> l0 <p> para </li>
</ol>
l0

</body>
</html>


回答4:

This may get you started, or I could be completely on the wrong track..

I'm found p tags that need matches and then just added them before the next instance of

which won't solve the problem but I think its on the right track. I think you need to find out depth of the dom tree between p and where end p should be and then figure out the best spot for the end tag.

I think you almost have to do this recursively (which this isn't yet) and then possibly even look back to find the parent of p and match right before the end tag.

This is a difficult problem but here is my rough perl hack I came up with

#!/usr/bin/perl -w

use strict;

my $html = $_;

if($html =~ /(.*<body>)(.*)(<\/body>.*)/gi) {
    my $inner = $2;
    while($inner =~ /(<p.*?>)(.[^<>]*)(?!((<address|<article|<aside|<blockquote|<dir|<div|<dl|<fieldset|<footer|<form|<h1|<h2|<h3|<h4|<h5|<h6|<header|<hr|<menu|<nav|<ol|<p|<pre|<section|<table|<ul)(.*?>)))([.^<>]*)(?!<\/p>)/gi) {
        #do stuff here?
    }
    $html = "$1$inner$3";
}


回答5:

I think the following perl code should be reasonably conservative enough that it serializes many paragraph cases without inserting bad closers. ymmv...

  my $list= qr/address|article|aside|blockquote|dir|div|dl|fieldset|footer|form|h1|h2|h3|h4|h5|h6|header|hr|menu|na\
v|ol|p|pre|section|table|ul|html|body|li|dt|dd/;

  my $last=$_;
  while (s/(\<p\b.*?\>)(.*?)(\<\/?$list\b.*?\>)/fixup($1,$2,$3)/gmse) {
    ($last eq $_) and last;
    $last= $_;
  }

  sub fixup {
    my ($a,$b,$c) = @_;
    ($_[2] =~ /\<\/p\>/) and return "$a$b$c";
    return "$a$b\<\/p\>$c"
  }