Java: Read in text files from a directory, from th

2019-04-15 14:58发布

问题:

Does anybody know how to recursively read in files from a specific directory on the internet, in Java? I want to read in all the text files from this web directory: http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/

I know how to read in multiple files that are in a folder on my computer, and I how to read in a single file from the internet. But how can I read in multiple files on the internet, without hardcoding the URLs in?

Stuff I tried:

// List the files on my Desktop
final File folder = new File("/Users/crystal/Desktop");
File[] listOfFiles = folder.listFiles();

for (int i = 0; i < listOfFiles.length; i++) {
    File fileEntry = listOfFiles[i];
    if (!fileEntry.isDirectory()) {
        System.out.println(fileEntry.getName());
    }
}

Another thing I tried:

// Reading data from the web 
try 
{
    // Create a URL object
    URL url = new URL("http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/5_1_1.txt");

    // Read all of the text returned by the HTTP server
    BufferedReader in = new BufferedReader (new InputStreamReader(url.openStream()));

    String htmlText;      // String that holds current file line

    // Read through file one line at a time. Print line
    while ((htmlText = in.readLine()) != null) 
    {
        System.out.println(htmlText);
    }
    in.close();
} catch (MalformedURLException e) {
    e.printStackTrace();
} catch (IOException e) {
    // If another exception is generated, print a stack trace
    e.printStackTrace();
}

Thanks!

回答1:

Since the URL you mentioned has indexes enabled, you're in luck. You've got a few options here.

  1. Parse the html to find the attribute of the a tags, using SAX2 or any other XML parser. htmlunit would also work I think.
  2. Use a little regexp magic to match all string between <a href=" and "> and use that as the urls to read from.

Once you've got a list of all the URLs you need, then the second piece of code should work just fine. Just iterate over your list, and construct your URL from that list.

Here's a sample regex that should match what you want. It does catch a few extra links, but you should be able to filter those out.

<a\ href="(.+?)">


标签: java file input