I'm looking for something that I don't know exactly how it can be done. I don't have deep knowledge into crawling, scrapping and etc, but I believe the kind of technology I'm looking for are these.
- I've a list of around 100 websites that I'd like to monitor constantly. At least once every 3 or 4 days. In these website's I'd look for some logical matches, like:
Text contains 'ABC' AND doesn't contain 'BCZ" OR text contains 'XYZ' AND doesn't contain 'ATM' and so on so forth
The tool would have to look into these websites in:
- Web pages
- DOC files
- DOCX files
- XLS files
- XLSX files
- TXT files
- RTF files
- PDF files
- RAR and ZIP files
The matches would have to be incremental (I just want the most recent ones, from the previous X days)
Most importantly, out of these 100 websites, around 40 require user authentication (which I have already).
Whenever there's a match, I'd like to download:
- File
- Link
- Date/time
- Report of matches
I've been playing around with tools like import.io, but I haven't figured out how to do it properly!
Does anyone know exactly which kind of technology am I looking for? Who (what kind of specialist, programmer) could build this for me? Is it too hard for a programmer who understand about data crawling to build it?
Sorry for the long post