from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
print line
When printing a line in an HTML file, I'm trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.com">some text</a>'
, it will only print 'some text', '<b>hello</b>'
prints 'hello', etc. How would one go about doing this?
If you need to preserve HTML entities (i.e.
&
), I added "handle_entityref" method to Eloff's answer.An lxml.html-based solution (lxml is a native library and therefore much faster than any pure python solution).
If you need more control over what exactly is sanitized before converting to text then you might want to use the lxml Cleaner explicitly by passing the options you want in the constructor, e.g:
Using BeautifulSoup, html2text or the code from @Eloff, most of the time, it remains some html elements, javascript code...
So you can use a combination of these libraries and delete markdown formatting (Python 3):
It works well for me but it can be enhanced, of course...