I am trying to use CURL to get web pages from a paricualr website however it gives this error:
curl -q -v -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" https://www.saiglobal.com/ --output ./Downloads/test.html
....
* SSL certificate verify ok.
} [5 bytes data]
> GET / HTTP/1.1
> Host: www.saiglobal.com
> User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
> Accept: */*
>
0 0 0 0 0 0 0 0 --:--:-- 0:11:53 --:--:-- 0* OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104
* stopped the pause stream!
0 0 0 0 0 0 0 0 --:--:-- 0:11:53 --:--:-- 0
* Closing connection 0
} [5 bytes data]
curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104
I am not sure what is going on. I can't find a lot of useful info regarding to the error message. On my Mac, the errorno is 60 instead of 104.
However, using Chrome on these machines can load the page without any issue. One of the machines' CURL version is 7.58.0.
Any help is appreciated.
The problem is not the certificate of this site. From the debug output it can be clearly seen that the TLS handshake is done successfully and outside this handshake the certificate does not matter.
But, it can be seen that the site www.saiglobal.com
is CDN protected by Akamai CDN and Akamai features some kind of bot detection:
$ dig www.saiglobal.com
...
www.saiglobal.com. 45 IN CNAME www.saiglobal.com.edgekey.net.
www.saiglobal.com.edgekey.net. 62 IN CNAME e9158.a.akamaiedge.net.
This bot detection is known to use some heuristics in order to distinguish bots from normal browsers and detection of a bot might result in a status code 403 access denied or in a simple hang of the site - see Scraping attempts getting 403 error or Requests SSL connection timeout.
In this specific case it seems to currently help if some specific HTTP headers are added, specifically Accept-Encoding
, Accept-Language
, Connection
with a value of keep-alive
and User-Agent
which matches somehow Mozilla
. Failure to add these headers or having the wrong values will result in a hang.
The following works currently for me:
$ curl -q -v \
-H "Connection: keep-alive" \
-H "Accept-Encoding: identity" \
-H "Accept-Language: en-US" \
-H "User-Agent: Mozilla/5.0" \
https://www.saiglobal.com/
Note that this deliberately tries to bypass the bot detection. It might stop working if Akamai makes changes to the bot detection.
Please note also that the owner of the site has explicitly enable bot detection for a reason. This means that with deliberately bypassing the detection for your own gain (like providing some service based on scraped information) you might get into legal problems.