Crawl online directories and parse online pdf docu

2019-09-07 11:00发布

问题:

I need to be able to crawl an online directory such as for example this one http://svn.apache.org/repos/asf/ and whenever a pdf, docx, txt, or odt file come across the crawling, I need to be able to parse, and extract the text from it.

I am using files.walk in order to crawl around locally in my laptop, and Apache Tika library to parse text, and it works just fine, but I don't really know how can I do the same in an online directory.

Here's the code that goes through my PC and parses the files just so you guys have an idea of what I'm doing:

public static void GetFiles() throws IOException {
    //PathXml is the path directory such as  "/home/user/" that
    //is taken from an xml file .
    Files.walk(Paths.get(PathXml)).forEach(filePath -> { //Crawling process (Using Java 8)
        if (Files.isRegularFile(filePath)) {
            if (filePath.toString().endsWith(".pdf") || filePath.toString().endsWith(".docx") ||
                    filePath.toString().endsWith(".txt")){
                try {
                    TikaReader.ParsedText(filePath.toString());
                } catch (IOException e) {
                    e.printStackTrace();
                } catch (SAXException e) {
                    e.printStackTrace();
                } catch (TikaException e) {
                    e.printStackTrace();
                }
                System.out.println(filePath);
            }
        }
    });
}

and here's the TikaReader method:

public static String ParsedText(String file) throws IOException, SAXException, TikaException {
    InputStream stream = new FileInputStream(file); 
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    try {
        parser.parse(stream, handler, metadata);
        System.out.println(handler.toString());
        return handler.toString();
    } finally {
        stream.close();
    }
}

So again, how can I do the same thing with the given online directory above?