I am using scrapy to crawl a site which seems to be appending random values to the query string at the end of each URL. This is turning the crawl into a sort of an infinite loop.
How do i make scrapy to neglect the query string part of the URL's?
I am using scrapy to crawl a site which seems to be appending random values to the query string at the end of each URL. This is turning the crawl into a sort of an infinite loop.
How do i make scrapy to neglect the query string part of the URL's?
See urllib.urlparse
Example code:
from urlparse import urlparse
o = urlparse('http://url.something.com/bla.html?querystring=stuff')
url_without_query_string = o.scheme + "://" + o.netloc + o.path
Example output:
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> o = urlparse('http://url.something.com/bla.html?querystring=stuff')
>>> url_without_query_string = o.scheme + "://" + o.netloc + o.path
>>> print url_without_query_string
http://url.something.com/bla.html
>>>
There is a function url_query_cleaner
in w3lib.url
module (used by scrapy itself) to clean urls keeping only a list of allowed arguments.
Provide some code, so we can help you.
If you are using CrawlSpider
and Rule
's with SgmlLinkExtractor
, provide custom function to proccess_value
parameter of SgmlLinkExtractor
constructor.
See documentation for BaseSgmlLinkExtractor
def delete_random_garbage_from_url(url):
cleaned_url = ... # process url somehow
return cleaned_url
Rule(
SgmlLinkExtractor(
# ... your allow, deny parameters, etc
process_value=delete_random_garbage_from_url,
)
)
You can use the urllib.parse.urlsplit()
function. The result is a structured parse result, a named tuple with added functionality.
Use the namedtuple._replace()
method to alter the parsed result values, then use the SplitResult.geturl()
method to get a URL string again.
To remove the query string, set the query
value to None
:
from urllib.parse import urlsplit
updated_url = urlsplit(url)._replace(query=None).geturl()
Demo:
>>> from urllib.parse import urlsplit
>>> url = 'https://example.com/example/path?query_string=everything+after+the+questionmark'
>>> urlparse.urlsplit(url)._replace(query=None).geturl()
'https://example.com/example/path'
For Python 2, the same function is available under the urlparse.urlsplit()
name.
You could also use the urllparse.parse.urlparse()
function; for URLs without any path parameters, the result would be the same. The two functions differ in how path parameters are handled; urlparse()
only supports path parameters for the last segment of the path, while urlsplit()
leaves path parameters in place in the path, leaving parsing of such parameters to other code. Since path parameters are rarely used these days [later URL RFCs have dropped the feature altogether), the difference is academical. urlparse()
uses urlsplit()
and without parameters, doesn't add anything other than extra overhead. It is better to just use urlsplit()
directly.
If you are using BaseSpider, before yielding a new request, remove manually random values from the query part of the URL using urlparse:
def parse(self, response):
hxs = HtmlXPathSelector(response)
item_urls = hxs.select(".//a[@class='...']/@href").extract()
for item_url in item_urls:
# remove the bad part of the query part of the URL here
item_url = urlparse.urljoin(response.url, item_url)
self.log('Found item URL: %s' % item_url)
yield Request(item_url, callback = self.parse_item)
use this method to remove query string from url
urllink="http://url.something.com/bla.html?querystring=stuff"
url_final=urllink.split('?')[0]
print(url_final)
output will be: http://url.something.com/bla.html