您在Java的URL的有效性。以免崩溃的404错误(check for validity of

从本质上讲，像防弹油箱，我想我的程序absord 404错误，并不断滚动，破碎的interwebs，留下尸体死亡，在其身后bludied，或者，W / E。

我不断收到此错误：

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=https://en.wikipedia.org/wiki/Hudson+Township+%28disambiguation%29
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:194)
at Q.Wikipedia_Disambig_Fetcher.all_possibilities(Wikipedia_Disambig_Fetcher.java:29)
at Q.Wikidata_Q_Reader.getQ(Wikidata_Q_Reader.java:54)
at Q.Wikipedia_Disambig_Fetcher.all_possibilities(Wikipedia_Disambig_Fetcher.java:38)
at Q.Wikidata_Q_Reader.getQ(Wikidata_Q_Reader.java:54)
at Q.Runner.main(Runner.java:35)

但我不明白为什么，因为我检查，看看我是否有一个有效的URL之前，我找到它。我的检查过程什么是不正确的？

我试图研究关于这一问题的其他堆栈溢出的问题，但他们不是很权威，再加上我实现了很多的解决方案，从这个和这个，迄今没有奏效。

我使用Apache公地URL验证，这是我一直在使用最新的代码：

    //get it's normal wiki disambig page
    String URL_check = "https://en.wikipedia.org/wiki/" + associated_alias;

    UrlValidator urlValidator = new UrlValidator();

    if ( urlValidator.isValid( URL_check ) ) 
    {
       Document docx = Jsoup.connect( URL_check ).get();
        //this can handle the less structured ones.

和

    //check the validity of the URL
    String URL_czech = "https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=" + associated_alias + "&submit=Search";

    UrlValidator urlValidator = new UrlValidator();

    if ( urlValidator.isValid( URL_czech ) ) 
    {
        URL wikidata_page = new URL( URL_czech );
        URLConnection wiki_connection = wikidata_page.openConnection();
        BufferedReader wiki_data_pagecontent = new BufferedReader(
                                                   new InputStreamReader(
                                                        wiki_connection.getInputStream()));

Answer 1:

该URLConnection网页时，您的下载状态代码返回比2XX（如200或201等...）以外的任何其他引发错误。而不是通过Jsoup URL或字符串来解析您的文档考虑传递数据的输入流中包含的网页。

使用HttpURLConnection类，我们可以尝试下载使用网页getInputStream()并放置在一个try/catch块，如果失败尝试通过下载它getErrorStream()

考虑这段代码，将下载您的wiki页面，即使它返回404

String URL_czech = "https://en.wikipedia.org/wiki/Hudson+Township+%28disambiguation%29";

URL wikidata_page = new URL(URL_czech);
HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
InputStream wikiInputStream = null;

try {
    // try to connect and use the input stream
    wiki_connection.connect();
    wikiInputStream = wiki_connection.getInputStream();
} catch(IOException e) {
    // failed, try using the error stream
    wikiInputStream = wiki_connection.getErrorStream();
}
// parse the input stream using Jsoup
Jsoup.parse(wikiInputStream, null, wikidata_page.getProtocol()+"://"+wikidata_page.getHost()+"/");