I need to write a Java client application which, when given the below URL, will enumerate the directories/files recursively beneath it. I also need to get the last modified timestamp for each since I'm only concerned with changes since a known timestamp.
http://www.myserver.com/testproduct/
For example, suppose the following exist on the server.
http://www.myserver.com/testproduct/red/file1.txt
http://www.myserver.com/testproduct/red/file2.txt
http://www.myserver.com/testproduct/red/black/file3.txt
http://www.myserver.com/testproduct/red/black/file4.txt
http://www.myserver.com/testproduct/orange/anotherfile.html
http://www.myserver.com/testproduct/orange/mymovie.avi
http://www.myserver.com/testproduct/readme.txt
I need to, starting at the specified URL (http://www.myserver.com/testproduct/) enumerate the directories and files recursively beneath it along with the last modified timestamp of each. Once I have the list of directories/files, I'll be selectively downloading some of the files based on timestamp and other client-side filters.
The server is running Apache and is configured to allow directory listing.
I did some experimentation using Apache's HttpClient Java class and when I request the contents of http://www.myserver.com/testproduct/ I get back an HTML file which of course is the same thing you see if you go there in your browser. Its an HTML page showing the contents of the folder.
Is this the only way to do it? i.e. scraping the resulting HTML page to parse out the files and directories? Also, I'm not sure I can reliably distinguish files from directories based on the HTML returned
Is there a better way to enumerate directories and files without page scraping the resultant HTML?
If you have any control over the server, you should ask them to implement WebDAV, which is meant for precisely that sort of scenario. Apache comes with a mod_dav
that just needs to be configured. On the Java client side, see this question
If your application is not on the same machine as the server, then there isn't much you can do beside scrape the data you're looking for. If you know about all of the products that exist on your server, then you can just issue web requests for each file and you will get them. However, if you only know about the root path or a single product page, then you will essentially have to crawl the web site and extract the links to the other products from the same web site. You would only select URLs to crawl if they're on the same host and you haven't seen/crawled them before.
For example:
if http://www.myserver.com/testproduct/
contains links to
http://www.myserver.com/testproduct/red/file1.txt
http://www.myserver.com/testproduct/red/file2.txt
http://www.devboost.com/
http://www.myspace.com/
http://blog.devboost.com/
http://beta.devboost.com/
http://www.myserver.com/testproduct/red/file2.txt
Then you would ignore any link that does not start with the host www.myserver.com
.
Regarding directories and timestamps: as pointed in the comments HTTP does not support directory browsing and if you're trying to get the time stamp when the file was last modified, then you're out of luck on that one too.
More importantly, I don't know how much it would benefit you to know that a file has not been changed when that file is generating dynamic content. For example: it's extremely likely that the file which is responsible for displaying a product page hasn't change in a LONG time. Usually, the same file will be responsible for displaying all of the products in the database and if it's part of an MVC-type framework. In other words: you would have to parse the HTML and determine if there are any changes which you care about, then process the file accordingly.