I'm looking for something that I don't know exactly how it can be done. I don't have deep knowledge into crawling, scrapping and etc, but I believe the kind of technology I'm looking for are these.
- I've a list of around 100 websites that I'd like to monitor constantly. At least once every 3 or 4 days. In these website's I'd look for some logical matches, like:
Text contains 'ABC' AND doesn't contain 'BCZ" OR text contains 'XYZ' AND doesn't contain 'ATM' and so on so forth
The tool would have to look into these websites in:
- Web pages
- DOC files
- DOCX files
- XLS files
- XLSX files
- TXT files
- RTF files
- PDF files
- RAR and ZIP files
The matches would have to be incremental (I just want the most recent ones, from the previous X days)
Most importantly, out of these 100 websites, around 40 require user authentication (which I have already).
Whenever there's a match, I'd like to download:
- File
- Link
- Date/time
- Report of matches
I've been playing around with tools like import.io, but I haven't figured out how to do it properly!
Does anyone know exactly which kind of technology am I looking for? Who (what kind of specialist, programmer) could build this for me? Is it too hard for a programmer who understand about data crawling to build it?
Sorry for the long post
For the 60 websites that don't require authentication:
You can use a tool like backstitch to mark websites you want to monitor, and get an interactive thumbnail feed of pages with content that have the keywords you want. Backstitch supports using boolean operators (the AND / OR functionality you described), and has an API that may allow you to export the results in a format that you need.
Their support team (and CEO) have been very helpful in the past with describing how their API can be used for custom search cases. Good luck!