This question already has an answer here:
- Strip HTML from strings in Python 21 answers
I have a text like this:
text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""
using pure Python, with no external module I want to have this:
>>> print remove_tags(text)
Title A long text..... a link
I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+
How can I do that?
Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is
xml.etree
, which works (somewhat) similarly to the lxml example you mention:Note that this isn't perfect, since if you had something like, say,
<a title=">">
it would break. However, it's about the closest you'd get in non-library Python without a really complex function:However, as lvc mentions
xml.etree
is available in the Python Standard Library, so you could probably just adapt it to serve like your existinglxml
version:Using a regex
Using a regex, you can clean everything inside
<>
:Using BeautifulSoup
You could also use
BeautifulSoup
additional package to find out all the raw textYou will need to explicitly set a parser when calling BeautifulSoup I recommend "lxml" as mentioned in alternative answers (much more robust than the default one (i.e. available without additional install) 'html.parser'
But it doesn't prevent you from using external libraries, so I recommend the first solution.
There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:
The idea based in a simple finite-state machine and is detailed explained here: http://youtu.be/2tu9LTDujbw
You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s
PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!