What strategy to migrate data from a spreadsheet t

2020-07-23 06:40发布

问题:

This is linked to my other question when to move from a spreadsheet to RDBMS

Having decided to move to an RDBMS from an excel book, here is what I propose to do.

The existing data is loosely structured across two sheets in a work-book. The first sheet contains main record. The second sheet allows additional data.

My target DBMS is mysql, but I'm open to suggestions.

  1. Define RDBMS schema
  2. Define, say, web-services to interface with the database so the same can be used for both, UI and migration.
  3. Define a migration script to
    • Read each group of affiliated rows from the spreadsheet
    • Apply validation/constraints
    • Write to RDBMS using the web-service
  4. Define macros/functions/modules in spreadsheet to enforce validation where possible. This will allow use of the existing system while the new comes up. At the same time, ( i hope ) it will reduce migration failures when the move is eventually made.

What strategy would you follow?

回答1:

There are two aspects to this question.

Data migration

Your first step will be to "Define RDBMS schema" but how far are you going to go with it? Spreadsheets are notoriously un-normalized and so have lots of duplication. You say in your other question that "Data is loosely structured, and there are no explicit constraints." If you want to transform that into a rigourously-defined schema (at least 3NF) then you are going to have to do some cleansing. SQL is the best tool for data manipulation.

I suggest you build two staging tables, one for each worksheet. Define the columns as loosely as possible (big strings basically) so that it is easy to load the spreadsheets' data. Once you have the data loaded into the staging tables you can run queries to assess the data quality:

  • how many duplicate primary keys?
  • how many different data formats?
  • what are the look-up codes?
  • do all the rows in the second worksheet have parent records in the first?
  • how consistent are code formats, data types, etc?
  • and so on.

These investigations will give you a good basis for writing the SQL with which you can populate your actual schema.

Or it might be that the data is so hopeless that you decide to stick with just the two tables. I think that is an unlikely outcome (most applications have some underlying structure, we just have to dig deep enough).

Data Loading

Your best bet is to export the spreadsheets to CSV format. Excel has a wizard to do this. Use it (rather than doing Save As...). If the spreadsheets contain any free text at all the chances are you will have sentences which contain commas, so make sure you choose a really safe separator, such as ^^~

Most RDBMS tools have a facility to import data from CSV files. Postgresql and Mysql are the obvious options for an NGO (I presume cost is a consideration) but both SQL Server and Oracle come in free (if restricted) Express editions. SQL Server obviously has the best integration with Excel. Oracle has a nifty feature called external tables which allow us to define a table where the data is held in a CSV file, removing the need for staging tables.

One other thing to consider is Google App Engine. This uses Big Table rather than an RDBMS but that might be more suited to your loosely-structured data. I suggest it because you mentioned Google Docs as an alternative solution. GAE is an attractive option because it is free (more or less, they start charging if usage exceeds some very generous thresholds) and it would solve the app sharing issue with those other NGOs. Obviously your organisation may have some qualms about Google hosting their data. It depends on what field they are operating in, and the sensitivity of the information.



回答2:

Obviously, you need to create a target DB and the necessary table structure. I would skip the web services and write a groovy script which reads the .xls (using the POI library), validates and saves the data in the database.

In my view, anything more involved (web services, GUI...) is not justified: these kinds of tasks are very well suited for scripts because they're concise and extremely flexible while things like performance, code base scalability and such are less of an issue here. Once you have something that works, you will be able to adapt the script to any future document with different data anomalies you run into in a matter of minutes or a few hours.

This is all assuming your data isn't in perfect order and needs to be filtered and/or cleaned.

Alternatively, if the data and validation rules aren't too complex, you can probably get good results with using a visual data transfer tool like Kettle: you just define the .xls as your source, the database table as the table, some validation/filter rules if needed and trigger the loading process. Quite painless.



回答3:

If you'd rather use a tool that roll your own, check out SeekWell, which lets you write to your database from Google Sheets. Once you define your schema, Select the tables into a Sheet, then edit or insert the records and mark them for the appropriate action (e.g., update, insert, etc.). Set the schedule for the update and you're done. Read more about it here. Disclaimer--I'm a co-founder.

Hope that helps!



回答4:

You might be doing more work than you need to. Excel spreadsheets can be saved as CVS or XML files and many RDBMS clients support importing these files directly into tables.

This could allow you skip writing web service wrappers and migration scripts. Your database constraints would still be properly enforced during any import. If your RDBMS data model or schema is very different from your Excel spreadsheets, however, then some translation would of course have to take place via scripts or XSLT.