How do I remove a query from a url?

2019-03-26 00:32发布

问题:

I am using scrapy to crawl a site which seems to be appending random values to the query string at the end of each URL. This is turning the crawl into a sort of an infinite loop.

How do i make scrapy to neglect the query string part of the URL's?

回答1:

See urllib.urlparse

Example code:

from urlparse import urlparse
o = urlparse('http://url.something.com/bla.html?querystring=stuff')

url_without_query_string = o.scheme + "://" + o.netloc + o.path

Example output:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> o = urlparse('http://url.something.com/bla.html?querystring=stuff')
>>> url_without_query_string = o.scheme + "://" + o.netloc + o.path
>>> print url_without_query_string
http://url.something.com/bla.html
>>> 


回答2:

There is a function url_query_cleaner in w3lib.url module (used by scrapy itself) to clean urls keeping only a list of allowed arguments.



回答3:

Provide some code, so we can help you.

If you are using CrawlSpider and Rule's with SgmlLinkExtractor, provide custom function to proccess_value parameter of SgmlLinkExtractor constructor.

See documentation for BaseSgmlLinkExtractor

def delete_random_garbage_from_url(url):
    cleaned_url = ... # process url somehow
    return cleaned_url

Rule(
    SgmlLinkExtractor(
         # ... your allow, deny parameters, etc
         process_value=delete_random_garbage_from_url,
    )
)


回答4:

You can use the urllib.parse.urlsplit() function. The result is a structured parse result, a named tuple with added functionality.

Use the namedtuple._replace() method to alter the parsed result values, then use the SplitResult.geturl() method to get a URL string again.

To remove the query string, set the query value to None:

from urllib.parse import urlsplit

updated_url = urlsplit(url)._replace(query=None).geturl()

Demo:

>>> from urllib.parse import urlsplit
>>> url = 'https://example.com/example/path?query_string=everything+after+the+questionmark'
>>> urlparse.urlsplit(url)._replace(query=None).geturl()
'https://example.com/example/path'

For Python 2, the same function is available under the urlparse.urlsplit() name.

You could also use the urllparse.parse.urlparse() function; for URLs without any path parameters, the result would be the same. The two functions differ in how path parameters are handled; urlparse() only supports path parameters for the last segment of the path, while urlsplit() leaves path parameters in place in the path, leaving parsing of such parameters to other code. Since path parameters are rarely used these days [later URL RFCs have dropped the feature altogether), the difference is academical. urlparse() uses urlsplit() and without parameters, doesn't add anything other than extra overhead. It is better to just use urlsplit() directly.



回答5:

If you are using BaseSpider, before yielding a new request, remove manually random values from the query part of the URL using urlparse:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    item_urls = hxs.select(".//a[@class='...']/@href").extract()
    for item_url in item_urls:
        # remove the bad part of the query part of the URL here
        item_url = urlparse.urljoin(response.url, item_url)
        self.log('Found item URL: %s' % item_url)
        yield Request(item_url, callback = self.parse_item)


回答6:

use this method to remove query string from url

urllink="http://url.something.com/bla.html?querystring=stuff"
url_final=urllink.split('?')[0]
print(url_final)

output will be: http://url.something.com/bla.html