I am working on a metrics project for my team. I have to load several different reports into a central repository and then create tables and reports off of this data.
The data sources are:
- CSV files
- PDFs
- Ad-hoc/manual data.
I was playing with Talend and MySQL. I am a little confused as to how to load the CSV files. Should I have a collection of directories and one or more scheduled tasks that load the files?
Another thought was to write a custom file processor that will load the file based on naming convention. What are your thoughts?
"PDF" is something complicated. as it is pdf... "ad-hoc/manual data" needs much more details.
But if we focus on csv and your question is related to those guys only if I'm right, I'd do this by writing an app which calls an SP in my mySQL DB, handing over the full path to the csv (and any additional data, such as table's "user friendly name" if needed - or any other meta-data you'd like to store) which executes an import using mySQL Load Data.
The reason is, there can be many rules in the "business logic" after a csv was imported, and it's easier to maintain an app according to changing business requirements, than changing DB behavior all the time, and, if something goes terribly wrong my DB would be safe and only the "import manager app" fails - as I don't have to store neither that nor the csvs on the same system where my DB is.
DBs, relational DBs are basically about storing data, and retrieving data rapidly based on 'set theory', not about taking care of how the data gets into the system.
So think about these questions before you start implementing anything:
- What happens to the csv after it was processed? Can be deleted? Should be moved to e.g. a "processed" folder? Should remain/stay where it was intact?
- If should stay where it was and as it was, what should I do to know I already processed the file? (set the "ready to archive" flag, for instance? Touch the "last modified" date and set it to 1950.01.01? Add an extra property to the file?
- What should I do if csv import fails (e.g. invalid data in file, or NULL value where it shouldn't have NULLs)? Display error? Mark csv as unusable? Send e-mail? Move to "processing_failed" folder?
- What to do if file count grows huge in the input folder?
- How easily can I change the import/process/etc if business logic changes, or csv format changes?
and so on. Think through all the options you have and decide.
I hope I answered your question ;)