How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just a text or it contains an html. Text is typically multiline with also empty lines

Update:

example inputs:

html:

<head><title>I'm title</title></head>
Hello, <b>world</b>

non-html:

<ht fldf d><
<html><head> head <body></body> html

标签： python html parsing detect

4条回答

老娘就宠你

2楼-- · 2019-03-18 00:01

One way I thought of was to intersect start and end tags found by attempting to parse the text as HTML and intersecting this set with a known set of acceptable HTMl elements.

Example:

#!/usr/bin/env python

from __future__ import print_function

from HTMLParser import HTMLParser


from html5lib.sanitizer import HTMLSanitizerMixin


class TestHTMLParser(HTMLParser):

    def __init__(self, *args, **kwargs):
        HTMLParser.__init__(self, *args, **kwargs)

        self.elements = set()

    def handle_starttag(self, tag, attrs):
        self.elements.add(tag)

    def handle_endtag(self, tag):
        self.elements.add(tag)


def is_html(text):
    elements = set(HTMLSanitizerMixin.acceptable_elements)

    parser = TestHTMLParser()
    parser.feed(text)

    return True if parser.elements.intersection(elements) else False


print(is_html("foo bar"))
print(is_html("<p>Hello World!</p>"))
print(is_html("<html><head><title>Title</title></head><body><p>Hello!</p></body></html>"))  # noqa

Output:

$ python foo.py
False
True
True

This works for partial text that contains a subset of HTML elements.

NB: This makes use of the html5lib so it may not work for other document types necessarily but the technique can be adapted easily.

0人赞添加讨论(0) 举报

别忘想泡老子

3楼-- · 2019-03-18 00:14

You can use an HTML parser, like BeautifulSoup. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:

>>> from bs4 import BeautifulSoup
>>> html = """<html>
... <head><title>I'm title</title></head>
... </html>"""
>>> non_html = "This is not an html"
>>> bool(BeautifulSoup(html, "html.parser").find())
True
>>> bool(BeautifulSoup(non_html, "html.parser").find())
False

This basically tries to find any html element inside the string. If found - the result is True.

Another example with an HTML fragment:

>>> html = "Hello, <b>world</b>"
>>> bool(BeautifulSoup(html, "html.parser").find())
True

Alternatively, you can use lxml.html:

>>> import lxml.html
>>> html = 'Hello, <b>world</b>'
>>> non_html = "<ht fldf d><"
>>> lxml.html.fromstring(html).find('.//*') is not None
True
>>> lxml.html.fromstring(non_html).find('.//*') is not None
False

0人赞添加讨论(0) 举报

何必那么认真

4楼-- · 2019-03-18 00:15

Expanding on the previous post I would do something like this for something quick and simple:

import sys, os

if os.path.exists("file.html"):
    checkfile=open("file.html", mode="r", encoding="utf-8")
    ishtml = False
    for line in checkfile:
        line=line.strip()
        if line == "</html>"
            ishtml = True
    if ishtml:
        print("This is an html file")
    else:
        print("This is not an html file")

0人赞添加讨论(0) 举报

可以哭但决不认输i

5楼-- · 2019-03-18 00:18

Check for ending tags. This is simplest and most robust I believe.

"</html>" in possibly_html

If there is an ending html tag, then it looks like html, otherwise not so much.

0人赞添加讨论(0) 举报

How to detect with python if the string contains h

Update:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间