Java HttpClient seems to be caching content

2019-06-24 04:30发布

问题:

I'm building a simple web-scraper and i need to fetch the same page a few hundred times, and there's an attribute in the page that is dynamic and should change at each request. I've built a multithreaded HttpClient based class to process the requests and i'm using an ExecutorService to make a thread pool and run the threads. The problem is that dynamic attribute sometimes doesn't change on each request and i end up getting the same value on like 3 or 4 subsequent threads. I've read alot about HttpClient and i really can't find where this problem comes from. Could it be something about caching, or something like it!?

Update: here is the code executed in each thread:

HttpContext localContext = new BasicHttpContext();

HttpParams params = new BasicHttpParams();
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params,
        HTTP.DEFAULT_CONTENT_CHARSET);
HttpProtocolParams.setUseExpectContinue(params, true);

ClientConnectionManager connman = new ThreadSafeClientConnManager();

DefaultHttpClient httpclient = new DefaultHttpClient(connman, params);

HttpHost proxy = new HttpHost(inc_proxy, Integer.valueOf(inc_port));
httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY,
        proxy);

HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");

String iden = null;
int timeoutConnection = 10000;
HttpConnectionParams.setConnectionTimeout(httpGet.getParams(),
        timeoutConnection);

try {

    HttpResponse response = httpclient.execute(httpGet, localContext);

    HttpEntity entity = response.getEntity();

    if (entity != null) {

        InputStream instream = entity.getContent();
        String result = convertStreamToString(instream);
        // System.out.printf("Resultado\n %s",result +"\n");
        instream.close();

        iden = StringUtils
                .substringBetween(result,
                        "<input name=\"iden\" value=\"",
                        "\" type=\"hidden\"/>");
        System.out.printf("IDEN:%s\n", iden);
        EntityUtils.consume(entity);
    }

}

catch (ClientProtocolException e) {
    // TODO Auto-generated catch block
    System.out.println("Excepção CP");

} catch (IOException e) {
    // TODO Auto-generated catch block
    System.out.println("Excepção IO");
}

回答1:

HTTPClient does not use cache by default (when you use DefaultHttpClient class only). It does so, if you use CachingHttpClient which is HttpClient interface decorator enabling caching:

HttpClient client = new CachingHttpClient(new DefaultHttpClient(), cacheConfiguration);

Then, it analyzes If-Modified-Since and If-None-Match headers in order to decide if request to the remote server is performed, or if its result is returned from cache.

I suspect, that your issue is caused by proxy server standing between your application and remote server.

You can test it easily with curl application; execute some number of requests omitting proxy:

#!/bin/bash

for i in {1..50}
do
  echo "*** Performing request number $i"
  curl -D - http://yourserveraddress.com -o $i -s
done

And then, execute diff between all downloaded files. All of them should have differences you mentioned. Then, add -x/--proxy <host[:port]> option to curl, execute this script and compare files again. If some responses are the same as others, then you can be sure that this is proxy server issue.



回答2:

Generally speaking, in order to test whether or not HTTP requests are being made over the wire, you can use a "sniffing" tool that analyzes network traffic, for example:

  • Fiddler ( http://fiddler2.com/fiddler2/ ) - I would start with this
  • Wireshark ( http://www.wireshark.org/ ) - more low level

I highly doubt HttpClient is performing caching of any sort (this would imply it needs to store pages in memory or on disk - not one of its capabilities).

While this is not an answer, its a point to ponder: Is it possible that the server (or some proxy in between) is returning you cached content? If you are performing many requests (simultaneously or near simultaneously) for the same content, the server may be returning you cached content because it has decided that the information has not "expired" yet. In fact the HTTP protocol provides caching directives for such functionality. Here is a site that provides a high level overview of the different HTTP caching mechanisms:

http://betterexplained.com/articles/how-to-optimize-your-site-with-http-caching/

I hope this gives you a starting point. If you have already considered these avenues then that's great.



回答3:

You could try appending some unique dummy parameter to the URL on every request to try to defeat any URL-based caching (in the server, or somewhere along the way). It won't work if caching isn't the problem, or if the server is smart enough to reject requests with unknown parameters, or if the server is caching but only based on parameters it cares about, or if your chosen parameter name collides with a parameter the site actually uses.

If this is the URL you're using http://www.example.org/index.html try using http://www.example.org/index.html?dummy=1

Set dummy to a different value for each request.