I am looking for a python module that will help me get rid of HTML tags but keep the text values. I tried BeautifulSoup before and I couldn't figure out how to do this simple task. I tried searching for Python modules that could do this but they all seem to be dependent on other libraries which does not work well on AppEngine.
Below is a sample code from Ruby's sanitize library and that's what I am after in Python:
require 'rubygems'
require 'sanitize'
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html) # => 'foo'
Thanks for your suggestions.
-e
Late, but.
You can use Jinja2.Markup()
http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags
This gives you a list of (Unicode) strings. If you want to turn it into a single string, use
''.join(thatlist)
.Using lxml:
If you don't want to use separate libs then you can import standard django utils. For example:
Also its already included in Django templates, so you dont need anything else, just use filter, like this:
Btw, this is one of the fastest way.
Prints: