from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
print line
When printing a line in an HTML file, I'm trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.com">some text</a>'
, it will only print 'some text', '<b>hello</b>'
prints 'hello', etc. How would one go about doing this?
A python 3 adaption of søren-løvborg's answer
The Beautiful Soup package does this immediately for you.
I have used Eloff's answer successfully for Python 3.1 [many thanks!].
I upgraded to Python 3.2.3, and ran into errors.
The solution, provided here thanks to the responder Thomas K, is to insert
super().__init__()
into the following code:... in order to make it look like this:
... and it will work for Python 3.2.3.
Again, thanks to Thomas K for the fix and for Eloff's original code provided above!
This method works flawlessly for me and requires no additional installations:
I'm parsing Github readmes and I find that the following really works well:
And then
Removes all markdown and html correctly.
I always used this function to strip HTML tags, as it requires only the Python stdlib:
On Python 2
For Python 3
Note: this works only for 3.1. For 3.2 or above, you need to call the parent class's init function. See Using HTMLParser in Python 3.2