I am working on crawler and I have to extract data from 200-300 links on Google Scholar. I have working parser which is getting data from pages (on every pages are 1-10 people profiles as result of my query. I'm extracting proper links, go to another page and do it again). During run of my program I spotted above error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.pl/citations%3Fmauthors%3DAGH%2BUniversity%2Bof%2BScience%2Band%2BTechnology%26hl%3Dpl%26view_op%3Dsearch_authors&q=CGMSBFMKrI0YiJHfqgUiGQDxp4NLfGBv6zgPSjfyQ9LBi5F-K1EbGwQ
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)
I know it is linked with simple google protection against robots. How I can improve my connection
Connection connection =
Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(10000)
.followRedirects(true);
to not have temporary ban? I know there is a way to check response, like this:
Connection.Response response =
Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
if (statusCode == 200) { ... }
else if (statusCode == 503) { do recconect magic}
But what should I do, when I got 503 error? Have I to use proxy? Random wait time beetween connections? I hope there is better idea than saving my results in file, do manual hard-restart of router and try with new IP :P