from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
print line
When printing a line in an HTML file, I'm trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.com">some text</a>'
, it will only print 'some text', '<b>hello</b>'
prints 'hello', etc. How would one go about doing this?
I needed a way to strip tags and decode HTML entities to plain text. The following solution is based on Eloff's answer (which I couldn't use because it strips entities).
A quick test:
Result:
Error handling:
&#apos;
, which is valid in XML and XHTML, but not plain HTML) will cause aValueError
exception.ValueError
exception.Security note: Do not confuse HTML stripping (converting HTML into plain text) with HTML sanitizing (converting plain text into HTML). This answer will remove HTML and decode entities into plain text – that does not make the result safe to use in a HTML context.
Example:
<script>alert("Hello");</script>
will be converted to<script>alert("Hello");</script>
, which is 100% correct behavior, but obviously not sufficient if the resulting plain text is inserted as-is into a HTML page.The rule is not hard: Any time you insert a plain-text string into HTML output, you should always HTML escape it (using
cgi.escape(s, True)
), even if you "know" that it doesn't contain HTML (e.g. because you stripped HTML content).(However, the OP asked about printing the result to the console, in which case no HTML escaping is needed.)
Python 3.4+ version: (with doctest!)
Note that HTMLParser has improved in Python 3 (meaning less code and better error handling).
The solutions with HTML-Parser are all breakable, if they run only once:
results in:
what you intend to prevent. if you use a HTML-Parser, count the Tags until zero are replaced:
I haven't thought much about the cases it will miss, but you can do a simple regex:
For those that don't understand regex, this searches for a string
<...>
, where the inner content is made of one or more (+
) characters that isn't a<
. The?
means that it will match the smallest string it can find. For example given<p>Hello</p>
, it will match<'p>
and</p>
separately with the?
. Without it, it will match the entire string<..Hello..>
.If non-tag
<
appears in html (eg.2 < 3
), it should be written as an escape sequence&...
anyway so the^<
may be unnecessary.Why all of you do it the hard way? You can use BeautifulSoup
get_text()
feature.Here's a solution similar to the currently accepted answer (https://stackoverflow.com/a/925630/95989), except that it uses the internal
HTMLParser
class directly (i.e. no subclassing), thereby making it significantly more terse:Short version!
Regex source: MarkupSafe. Their version handles HTML entities too, while this quick one doesn't.
Why can't I just strip the tags and leave it?
It's one thing to keep people from
<i>italicizing</i>
things, without leavingi
s floating around. But it's another to take arbitrary input and make it completely harmless. Most of the techniques on this page will leave things like unclosed comments (<!--
) and angle-brackets that aren't part of tags (blah <<<><blah
) intact. The HTMLParser version can even leave complete tags in, if they're inside an unclosed comment.What if your template is
{{ firstname }} {{ lastname }}
?firstname = '<a'
andlastname = 'href="http://evil.com/">'
will be let through by every tag stripper on this page (except @Medeiros!), because they're not complete tags on their own. Stripping out normal HTML tags is not enough.Django's
strip_tags
, an improved (see next heading) version of the top answer to this question, gives the following warning:Follow their advice!
To strip tags with HTMLParser, you have to run it multiple times.
It's easy to circumvent the top answer to this question.
Look at this string (source and discussion):
The first time HTMLParser sees it, it can't tell that the
<img...>
is a tag. It looks broken, so HTMLParser doesn't get rid of it. It only takes out the<!-- comments -->
, leaving you withThis problem was disclosed to the Django project in March, 2014. Their old
strip_tags
was essentially the same as the top answer to this question. Their new version basically runs it in a loop until running it again doesn't change the string:Of course, none of this is an issue if you always escape the result of
strip_tags()
.Update 19 March, 2015: There was a bug in Django versions before 1.4.20, 1.6.11, 1.7.7, and 1.8c1. These versions could enter an infinite loop in the strip_tags() function. The fixed version is reproduced above. More details here.
Good things to copy or use
My example code doesn't handle HTML entities - the Django and MarkupSafe packaged versions do.
My example code is pulled from the excellent MarkupSafe library for cross-site scripting prevention. It's convenient and fast (with C speedups to its native Python version). It's included in Google App Engine, and used by Jinja2 (2.7 and up), Mako, Pylons, and more. It works easily with Django templates from Django 1.7.
Django's strip_tags and other html utilities from a recent version are good, but I find them less convenient than MarkupSafe. They're pretty self-contained, you could copy what you need from this file.
If you need to strip almost all tags, the Bleach library is good. You can have it enforce rules like "my users can italicize things, but they can't make iframes."
Understand the properties of your tag stripper! Run fuzz tests on it! Here is the code I used to do the research for this answer.
sheepish note - The question itself is about printing to the console, but this is the top Google result for "python strip html from string", so that's why this answer is 99% about the web.