How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just a text or it contains an html. Text is typically multiline with also empty lines
Update:
example inputs:
html:
<head><title>I'm title</title></head>
Hello, <b>world</b>
non-html:
<ht fldf d><
<html><head> head <body></body> html
One way I thought of was to intersect start and end tags found by attempting to parse the text as HTML and intersecting this set with a known set of acceptable HTMl elements.
Example:
Output:
This works for partial text that contains a subset of HTML elements.
NB: This makes use of the html5lib so it may not work for other document types necessarily but the technique can be adapted easily.
You can use an HTML parser, like
BeautifulSoup
. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:This basically tries to find any html element inside the string. If found - the result is
True
.Another example with an HTML fragment:
Alternatively, you can use
lxml.html
:Expanding on the previous post I would do something like this for something quick and simple:
Check for ending tags. This is simplest and most robust I believe.
If there is an ending html tag, then it looks like html, otherwise not so much.