Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script>
tags and html comments which I don't want. I can't figure out the arguments I need for the function findAll()
in order to just get the visible texts on a webpage.
So, how should I find all visible text excluding scripts, comments, css etc.?
Using BeautifulSoup the easiest way with less code to just get the strings, without empty lines and crap.
The approved answer from @jbochi does not work for me. The str() function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element. Here is a more succinct way to filter the example web page to visible text.
If you care about performance, here's another more efficient way:
soup.strings
is an iterator, and it returnsNavigableString
so that you can check the parent's tag name directly, without going through multiple loops.The title is inside an
<nyt_headline>
tag, which is nested inside an<h1>
tag and a<div>
tag with id "article".Should work.
The article body is inside an
<nyt_text>
tag, which is nested inside a<div>
tag with id "articleBody". Inside the<nyt_text>
element, the text itself is contained within<p>
tags. Images are not within those<p>
tags. It's difficult for me to experiment with the syntax, but I expect a working scrape to look something like this.While, i would completely suggest using beautiful-soup in general, if anyone is looking to display the visible parts of a malformed html (e.g. where you have just a segment or line of a web-page) for whatever-reason, the the following will remove content between
<
and>
tags: