I am trying to crawl a page that requires Siteminder Authentication, So I am trying to pass my username and password in the code itself to access that page and keep on crawling all the links that are there in that page. This is my Controller.java
code. And from this MyCrawler class is getting called.
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://ho.somehost.com/");
controller.start(MyCrawler.class, 10);
controller.setPolitenessDelay(200);
controller.setMaximumCrawlDepth(3);
}
}
And this is my MyCrawler.java code. In this I am passing my credentials(username and password) for siteminder authentication. And just wanted to make sure that authentication should be done in this MyCrawler code or the above Controller code..??? And this crawler code is taken from here (http://code.google.com/p/crawler4j/)
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
System.out.println("RJ:- " +url);
DefaultHttpClient client = null;
try
{
// Set url
//URI uri = new URI(url.toString());
client = new DefaultHttpClient();
client.getCredentialsProvider().setCredentials(
new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, null),
new UsernamePasswordCredentials("test", "test"));
// Set timeout
//client.getParams().setParameter(CoreConnectionPNames.SO_TIMEOUT, 5000);
HttpGet request = new HttpGet(url.toString());
HttpResponse response = client.execute(request);
if(response.getStatusLine().getStatusCode() == 200)
{
InputStream responseIS = response.getEntity().getContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(responseIS));
String line = reader.readLine();
while (line != null)
{
System.out.println(line);
line = reader.readLine();
}
}
else
{
System.out.println("Resource not available");
}
}
catch (ClientProtocolException e)
{
System.out.println(e.getMessage());
}
catch (ConnectTimeoutException e)
{
System.out.println(e.getMessage());
}
catch (IOException e)
{
System.out.println(e.getMessage());
}
catch (Exception e)
{
System.out.println(e.getMessage());
}
finally
{
if ( client != null )
{
client.getConnectionManager().shutdown();
}
}
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (href.startsWith("http://")) {
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Text length: " + text.length());
System.out.println("Number of links: " + links.size());
System.out.println("Docid of parent page: " + parentDocid);
System.out.println("=============");
}
}
I am printing the url so that I can see what url's are getting printed. So by that way it prints two url one the actual url that requires authentication and then some siteminder url. And when I run this project I get error as following..
RJ:- http://ho.somehost.com/net/pa/ho.xhtml
WARN [Crawler 1] Invalid cookie header: "Set-Cookie: SMCHALLENGE=; expires=Sat, 15 Jan 2011 02:52:54 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 15 Jan 2011 02:52:54 GMT
WARN [Crawler 1] Invalid cookie header: "Set-Co## Heading ##okie: SMIDENTITY=nzFSq2U3g/C3C6/jkj/Ocghyh/njK; expires=Sat, 13 Jul 2013 02:52:54 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 13 Jul 2013 02:52:54 GMT
null
INFO [Crawler 1] Number of pages fetched per second: 0
RJ:- https://lo.somehost.com/site/no/176/sm.exhtml
WARN [Crawler 1] Invalid cookie header: "Set-Cookie: SMCHALLENGE=; expires=Sat, 15 Jan 2011 02:52:56 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 15 Jan 2011 02:52:56 GMT
WARN [Crawler 1] Invalid cookie header: "Set-Cookie: SMIDENTITY=IqsIPo; expires=Sat, 13 Jul 2013 02:52:56 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 13 Jul 2013 02:52:56 GMT
Any suggestions will be appreciated..And If I copy paste that login url into the browser, then it ask for username and password and If I type my username and password, then I get the actual screen.