I'm trying to get rid of <script>
tags and the content inside the tag utilizing beatifulsoup. I went to the documentation and seems to be a really simple function to call. More information about the function is here. Here is the content of the html page that I have parsed so far...
<body class="pb-theme-normal pb-full-fluid">
<div class="pub_300x250 pub_300x250m pub_728x90 text-ad textAd text_ad text_ads text-ads text-ad-links" id="wp-adb-c" style="width: 1px !important;
height: 1px !important;
position: absolute !important;
left: -10000px !important;
top: -1000px !important;
">
</div>
<div id="pb-f-a">
</div>
<div class="" id="pb-root">
<script>
(function(a){
TWP=window.TWP||{};
TWP.Features=TWP.Features||{};
TWP.Features.Page=TWP.Features.Page||{};
TWP.Features.Page.PostRecommends={};
TWP.Features.Page.PostRecommends.url="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/hybrid.json?callback\x3d?";
TWP.Features.Page.PostRecommends.trackUrl="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/tracker.json?callback\x3d?";
TWP.Features.Page.PostRecommends.profileUrl="https://usersegment.wpdigital.net/usersegments";
TWP.Features.Page.PostRecommends.canonicalUrl=""
})(jQuery);
</script>
</div>
</body>
Imagine you have some web content like that and you have that in a BeautifulSoup object called soup_html
. If I run soup_html.script.decompose()
and them call the object soup_html
the script tags still there. How I can get rid of the <script>
and the content inside those tags?
markup = 'The html above'
soup = BeautifulSoup(markup)
html_body = soup.body
soup.script.decompose()
html_body
This would remove a single script element from the "Soup" only. Instead, I think you meant to decompose all of them:
I was able to fix the issue with the following code...
The error was that the
with open(...
was part or thefor match...
Code that did not work...
The soup.script.decompose() would only remove it from the soup variable... not the html_body variable. you would have to remove it from the html_body variable as well. (I think.)
To elaborate on the answer provided by alecxe, here is a full script for anyone's reference: