I am working on a metrics project for my team. I have to load several different reports into a central repository and then create tables and reports off of this data.
The data sources are:
- CSV files
- PDFs
- Ad-hoc/manual data.
I was playing with Talend and MySQL. I am a little confused as to how to load the CSV files. Should I have a collection of directories and one or more scheduled tasks that load the files?
Another thought was to write a custom file processor that will load the file based on naming convention. What are your thoughts?
"PDF" is something complicated. as it is pdf... "ad-hoc/manual data" needs much more details.
But if we focus on csv and your question is related to those guys only if I'm right, I'd do this by writing an app which calls an SP in my mySQL DB, handing over the full path to the csv (and any additional data, such as table's "user friendly name" if needed - or any other meta-data you'd like to store) which executes an import using mySQL Load Data.
The reason is, there can be many rules in the "business logic" after a csv was imported, and it's easier to maintain an app according to changing business requirements, than changing DB behavior all the time, and, if something goes terribly wrong my DB would be safe and only the "import manager app" fails - as I don't have to store neither that nor the csvs on the same system where my DB is.
DBs, relational DBs are basically about storing data, and retrieving data rapidly based on 'set theory', not about taking care of how the data gets into the system.
So think about these questions before you start implementing anything:
and so on. Think through all the options you have and decide.
I hope I answered your question ;)