Invalid Cookie Header and then it ask's for Au

2019-09-16 07:46发布

问题:

I am trying to crawl a page that requires Siteminder Authentication, So I am trying to pass my username and password in the code itself to access that page and keep on crawling all the links that are there in that page. This is my Controller.java code. And from this MyCrawler class is getting called.

public class Controller {
    public static void main(String[] args) throws Exception {

            CrawlController controller = new CrawlController("/data/crawl/root");

            controller.addSeed("http://ho.somehost.com/");

            controller.start(MyCrawler.class, 10);  
            controller.setPolitenessDelay(200);
            controller.setMaximumCrawlDepth(3);
    }
}

And this is my MyCrawler.java code. In this I am passing my credentials(username and password) for siteminder authentication. And just wanted to make sure that authentication should be done in this MyCrawler code or the above Controller code..??? And this crawler code is taken from here (http://code.google.com/p/crawler4j/)

public class MyCrawler extends WebCrawler {

    Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
            + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
            + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

    public MyCrawler() {


    }

    public boolean shouldVisit(WebURL url) {

        System.out.println("RJ:- " +url);

        DefaultHttpClient client = null;

        try
        {
            // Set url
            //URI uri = new URI(url.toString());

            client = new DefaultHttpClient();

            client.getCredentialsProvider().setCredentials(
                    new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, null),
                    new UsernamePasswordCredentials("test", "test"));

            // Set timeout
            //client.getParams().setParameter(CoreConnectionPNames.SO_TIMEOUT, 5000);
            HttpGet request = new HttpGet(url.toString());

            HttpResponse response = client.execute(request);
            if(response.getStatusLine().getStatusCode() == 200)
            {
                InputStream responseIS = response.getEntity().getContent();
                BufferedReader reader = new BufferedReader(new InputStreamReader(responseIS));
                String line = reader.readLine();
                while (line != null)
                {
                    System.out.println(line);
                    line = reader.readLine();
                }
            }
            else
            {
                System.out.println("Resource not available");
            }
        }
        catch (ClientProtocolException e)
        {
            System.out.println(e.getMessage());
        }
        catch (ConnectTimeoutException e)
        {
            System.out.println(e.getMessage());
        }
        catch (IOException e)
        {
            System.out.println(e.getMessage());
        }
        catch (Exception e)
        {
            System.out.println(e.getMessage());
        }
        finally
        {
            if ( client != null )
            {
                client.getConnectionManager().shutdown();
            }
        }


        String href = url.getURL().toLowerCase();
        if (filters.matcher(href).matches()) {
            return false;
        }
        if (href.startsWith("http://")) {
            return true;
        }
        return false;
    }

    public void visit(Page page) {
        int docid = page.getWebURL().getDocid();
        String url = page.getWebURL().getURL();         
        String text = page.getText();
        List<WebURL> links = page.getURLs();
        int parentDocid = page.getWebURL().getParentDocid();

        System.out.println("Docid: " + docid);
        System.out.println("URL: " + url);
        System.out.println("Text length: " + text.length());
        System.out.println("Number of links: " + links.size());
        System.out.println("Docid of parent page: " + parentDocid);
        System.out.println("=============");
    }   
}

I am printing the url so that I can see what url's are getting printed. So by that way it prints two url one the actual url that requires authentication and then some siteminder url. And when I run this project I get error as following..

RJ:- http://ho.somehost.com/net/pa/ho.xhtml
 WARN [Crawler 1] Invalid cookie header: "Set-Cookie: SMCHALLENGE=; expires=Sat, 15 Jan 2011 02:52:54 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 15 Jan 2011 02:52:54 GMT
 WARN [Crawler 1] Invalid cookie header: "Set-Co## Heading ##okie: SMIDENTITY=nzFSq2U3g/C3C6/jkj/Ocghyh/njK; expires=Sat, 13 Jul 2013 02:52:54 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 13 Jul 2013 02:52:54 GMT
null
 INFO [Crawler 1] Number of pages fetched per second: 0
RJ:- https://lo.somehost.com/site/no/176/sm.exhtml
 WARN [Crawler 1] Invalid cookie header: "Set-Cookie: SMCHALLENGE=; expires=Sat, 15 Jan 2011 02:52:56 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 15 Jan 2011 02:52:56 GMT
 WARN [Crawler 1] Invalid cookie header: "Set-Cookie: SMIDENTITY=IqsIPo; expires=Sat, 13 Jul 2013 02:52:56 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 13 Jul 2013 02:52:56 GMT

Any suggestions will be appreciated..And If I copy paste that login url into the browser, then it ask for username and password and If I type my username and password, then I get the actual screen.

回答1:

Extracting the salient contents of the chat discussion for posterity, in case anyone experiences the same issue.

The warning message displayed, indicated that HttpClient was unable to parse the Set-Cookie header issued by SiteMinder. Analysis of the network traffic using Wireshark revealed the following:

  • No expires attribute was set for the cookie SMSESSION, which was issued by SiteMinder. This is not the cause of the problem; it is just a note that the HTTP response from the server responsible for the warning needs to be looked.
  • The warnings were issued for the cookies SMCHALLENGE and SMIDENTITY. Therefore, the responses containing the Set-Cookie headers for these two cookies need to examined.
  • The problem could be in:
    • the cookie values themselves, or
    • the format of the dates in the expires attribute of the cookies.
  • Bug no 923 of HttpClient was fixed in version 4.1.1 of HttpClient and might contain an resolution. The fix, is for supporting both 2 and 4 digit years and might be the cause of the issue.

If the above (use of 4 digit years in the cookie expires value) turns out to be an incorrect root cause, then one must specify the date format used to parse the cookie value. This can be done by specifying the list of allowed/accepted date formats by using HttpClient in the following manner:

HttpGet request = new HttpGet(url.toString());
request.getParams().setParameter(CookieSpecPNames.DATE_PATTERNS, Arrays.asList("EEE, d MMM yyyy HH:mm:ss z"));
HttpResponse response = client.execute(request);

instead of the existing calls:

HttpGet request = new HttpGet(url.toString());

HttpResponse response = client.execute(request);

The pattern specified EEE, d MMM yyyy HH:mm:ss z is a valid pattern for the dates that appear to be parsed incorrectly (going by the messages in the console). You will need to add additional patterns if there are other date formats that are not handled correctly by HttpClient. For details on the format used, see the SimpleDateFormat class documentation.