Our team crawls websites to keep our info up to date. I was running into security exceptions when crawling HTTPS pages. The issue was that Java had an issue accepting self signed certificates from pages.
Rather than keeping a list of certificates to accept (which could be difficult to maintain in the future), I'm using the work around provided by neu242 to disable SSL certification validation.
public static void disableCertificateValidation()
{
// Create a trust manager that does not validate certificate chains
TrustManager[] trustAllCerts = new TrustManager[] {
new X509TrustManager() {
public X509Certificate[] getAcceptedIssuers() {
return new X509Certificate[0];
}
@Override
public void checkClientTrusted(X509Certificate[] arg0, String arg1) throws CertificateException
{
// TODO Auto-generated method stub
}
@Override
public void checkServerTrusted(X509Certificate[] arg0, String arg1) throws CertificateException
{
// TODO Auto-generated method stub
}
}};
// Ignore differences between given hostname and certificate hostname
HostnameVerifier hv = new HostnameVerifier() {
@Override
public boolean verify(String arg0, SSLSession arg1)
{
// TODO Auto-generated method stub
return true;
}
};
// Install the all-trusting trust manager
try {
SSLContext sc = SSLContext.getInstance("SSL");
sc.init(null, trustAllCerts, new SecureRandom());
HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
HttpsURLConnection.setDefaultHostnameVerifier(hv);
} catch (Exception e) {}
}
Of course, this is posing a significant security risk. However, if I'm only using this code with my downloading program (the program which downloads images and pdf docs), and I'm not using the program to send any sensitive information, what security risks exist? From what I understand, this trust manager will be set only for the running JVM (the server running the program won't disable cert validation at the OS level). Further, if my requests for the images and docs were intercepted, my code will try to form the response into an image or pdf, respectively, and won't launch any malicious software. Is there a security risk I'm missing somewhere?
The only realistic "interception" scenario is when the client computer is on a wireless network at a random place, e.g. a coffee shop. Since yours scraping server is obviously not, in practice there is zero risk in disabling certificate validation.
The risk that you are facing is that a malicious server could place itself between you and the origin server (it's a man-in-the-middle attack). In other words, you would THINK that you receive documents from the real server, but in fact you would receive the documents from the pirate server. So it depends on the types of documents and what you do with them...
You can disable cert validation. Your communication will still be secure, you just wont be able to authenticate the identity of the server. I don't see any issues.