from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
print line
When printing a line in an HTML file, I'm trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.com">some text</a>'
, it will only print 'some text', '<b>hello</b>'
prints 'hello', etc. How would one go about doing this?
You can write your own function:
This is a quick fix and can be even more optimized but it will work fine. This code will replace all non empty tags with "" and strips all html tags form a given input text .You can run it using ./file.py input output
There's a simple way to this:
The idea is explained here: http://youtu.be/2tu9LTDujbw
You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s
PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!
You're welcome! :)
If you want to strip all HTML tags the easiest way I found is using BeautifulSoup:
I tried the code of the accepted answer but I was getting "RuntimeError: maximum recursion depth exceeded", which didn't happen with the above block of code.
For one project, I needed so strip HTML, but also css and js. Thus, I made a variation of Eloffs answer:
You can use either a different HTML parser (like lxml, or Beautiful Soup) -- one that offers functions to extract just text. Or, you can run a regex on your line string that strips out the tags. See http://www.amk.ca/python/howto/regex/ for more.