I need to be able to crawl an online directory such as for example this one http://svn.apache.org/repos/asf/
and whenever a pdf, docx, txt, or odt file come across the crawling, I need to be able to parse, and extract the text from it.
I am using files.walk
in order to crawl around locally in my laptop, and Apache Tika
library to parse text, and it works just fine, but I don't really know how can I do the same in an online directory.
Here's the code that goes through my PC and parses the files just so you guys have an idea of what I'm doing:
public static void GetFiles() throws IOException {
//PathXml is the path directory such as "/home/user/" that
//is taken from an xml file .
Files.walk(Paths.get(PathXml)).forEach(filePath -> { //Crawling process (Using Java 8)
if (Files.isRegularFile(filePath)) {
if (filePath.toString().endsWith(".pdf") || filePath.toString().endsWith(".docx") ||
filePath.toString().endsWith(".txt")){
try {
TikaReader.ParsedText(filePath.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
System.out.println(filePath);
}
}
});
}
and here's the TikaReader method:
public static String ParsedText(String file) throws IOException, SAXException, TikaException {
InputStream stream = new FileInputStream(file);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try {
parser.parse(stream, handler, metadata);
System.out.println(handler.toString());
return handler.toString();
} finally {
stream.close();
}
}
So again, how can I do the same thing with the given online directory above?