I am trying to use data from the USDA found here: http://www.ars.usda.gov/Services/docs.htm?docid=23634
ASCII (8.6Mb) - This file contains the SR26 data in ASCII (ISO/IEC 8859-1), delimited files. These tables are organized in a relational format, and are best used with a relational database management system (RDBMS), which will allow you to form your own queries of the database and generate custom reports.
I am new to manipulating data like this and think I would like to get it in CSV, maybe? But, then I might lose the relationships so maybe I should go right to PGSQL. Not sure how to approach this.
Looking for guidance, thanks.
The zip contains a number of files:
inflating: DATA_SRC.txt
inflating: DATSRCLN.txt
inflating: DERIV_CD.txt
inflating: FD_GROUP.txt
inflating: FOOD_DES.txt
inflating: FOOTNOTE.txt
inflating: LANGDESC.txt
inflating: LANGUAL.txt
inflating: NUT_DATA.txt
inflating: NUTR_DEF.txt
inflating: sr26_doc.pdf
inflating: SRC_CD.txt
inflating: WEIGHT.txt
each of which appears to be in a bizarre almost-CSV-like format, e.g. NUTR_DEF.txt
:
~287~^~g~^~GALS~^~Galactose~^~2~^~2100~
~291~^~g~^~FIBTG~^~Fiber, total dietary~^~1~^~1200~
plus sr26_doc.pdf
, the documentation.
Creating table definitions
So what you need to do here is create SQL table definitions for the database - with one table for each input file. You need the CREATE TABLE
command for this; see the PostgreSQL documentation.
Page 35 of the PDF should help you - "Figure 1. Relationships among files in the USDA National Nutrient Database for Standard Reference". The following pages describe the file formats, telling you what each column means. You can write CREATE TABLE
statements based on this description.
Here's an example, for FOOD_DES.txt
(food description), the first entry.
CREATE TABLE food_des (
"NDB_No" varchar(5) NOT NULL PRIMARY KEY,
"FdGrp_Cd" varchar(4) NOT NULL,
"Long_Desc" varchar(200) NOT NULL,
"Shrt_Desc" varchar(60) NOT NULL,
"ComName" varchar(100),
"ManufacName" varchar(65),
"Survey" varchar(1),
"Ref_desc" varchar(135),
"Refuse" smallint,
"SciName" varchar(65),
"N_Factor" NUMERIC(4,2),
"Pro_Factor" NUMERIC(4,2),
"Fat_Factor" NUMERIC(4,2),
"CHO_Factor" NUMERIC(4,2)
);
That's a pretty literal copy of the description. It's not how I'd design the table
I've used NUMERIC
arbitrary-precision decimal floating point types for accuracy in non-integer numeric types. If performance is more important than accuracy, you can use float4
instead.
For relationships, you use FOREIGN KEY
constraints - just colname coltype REFERENCES othertable(othercol)
is sufficient to create one.
Important: I double quoted the column names to preserve the same name as in the definitions. That means you have to always double quote them when you refer to them, e.g. SELECT "NDB_No" FROM food_des;
. If you don't want that, just leave off the double quotes - or pick different names. You don't have to stick to the clumsy abbreviated column names they used, and it's quite reasonable to write:
CREATE TABLE food_description (
ndb_no varchar(5) NOT NULL PRIMARY KEY,
foodgroup_code varchar(4) NOT NULL,
long_description varchar(200) NOT NULL,
short_description varchar(60) NOT NULL,
common_name varchar(100),
manufacturer_name varchar(65),
etc. Similarly, if you're working with Rails, you can convert the table definitions to follow Rails's conventions, especially if you then intend to do the data loading via Rails.
Loading data
If these were sane, sensible delimited files you could then just load each table using the psql
command \copy
, or PgAdmin-III's "import" option.
It is actually CSV, they've just decided to use totally bizarre delimiter and quote chars. Import via psql
with:
\copy food_des FROM 'FOOD_DES.txt' (FORMAT CSV, DELIMITER '^', QUOTE '~');
or the equivalent in whatever tool you use to talk to PostgreSQL.
The results are a sensible looking table:
craig=> select * from food_des limit 2;
NDB_No | FdGrp_Cd | Long_Desc | Shrt_Desc | ComName | ManufacName | Survey | Ref_desc | Refuse | SciName | N_Factor | Pro_Factor | Fat_Factor | CHO_Factor
--------+----------+----------------------------+--------------------------+---------+-------------+--------+----------+--------+---------+----------+------------+------------+------------
01001 | 0100 | Butter, salted | BUTTER,WITH SALT | | | Y | | 0 | | 6.38 | 4.27 | 8.79 | 3.87
01002 | 0100 | Butter, whipped, with salt | BUTTER,WHIPPED,WITH SALT | | | Y | | 0 | | 6.38 | 4.27 | 8.79 | 3.87
(2 rows)
Similarly, if using Rails you can use whatever Rails CSV library you want and bulk-load into models.
Using the tips from @craig-ringer I've created a script to migrate the original USDA Nutrition Database files to postgreSQL.
Is it very rough at the moment, but should help to get started.
Check it out here:
https://github.com/rlucha/usda-nutrition-database-migration
They have a version of that database for Microsoft access, if you have it you could download to access then convert to MySQL using Access To MySQL.