I'm currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the target sites.
The problem is : sometimes, my target node contains a <script>
tag and so the scraped text contains javascript code.
Here is a link to a real example of what I'm working with. In this case my target node is //td[@id='contenuStory']
. The problem is that there's a <script>
tag in the first child div.
I've spent a lot of time searching for a solution on the web and on SO, but I couldn't find anything. I hope I haven't missed something obvious !
Example
HTML response (only the target node) :
<div id="content">
<div id="part1">Some text</div>
<script>var s = 'javascript I don't want';</script>
<div id="part2">Some other text</div>
</div>
What I want in my item :
Some text
Some other text
What I get :
Some text
var s = 'javascript I don't want';
Some other text
My code
Given an xpath selector I'm using the following function to extract the text :
def getText(hxs):
if len(hxs) > 0:
l = hxs.select('string(.)')
if len(l) > 0:
s = l[0].extract().encode('utf-8')
else:
s = hxs[0].extract().encode('utf-8')
return s
else:
return 0
I've tried using XPath axes (things like child::script
) but to no avail.
Try utils functions from
w3lib.html
:You can try this XPath expression:
i.e, all children text nodes of descendants of
//td[@id='contenuStory']
that are notscript
nodesTo add space between the text nodes you can use something like: