How can I import data from ASCII (ISO/IEC 8859-1)

2019-01-17 08:24发布

I am trying to use data from the USDA found here: http://www.ars.usda.gov/Services/docs.htm?docid=23634

ASCII (8.6Mb) - This file contains the SR26 data in ASCII (ISO/IEC 8859-1), delimited files. These tables are organized in a relational format, and are best used with a relational database management system (RDBMS), which will allow you to form your own queries of the database and generate custom reports.

I am new to manipulating data like this and think I would like to get it in CSV, maybe? But, then I might lose the relationships so maybe I should go right to PGSQL. Not sure how to approach this.

Looking for guidance, thanks.

3条回答
相关推荐>>
2楼-- · 2019-01-17 08:45

Using the tips from @craig-ringer I've created a script to migrate the original USDA Nutrition Database files to postgreSQL.

Is it very rough at the moment, but should help to get started.

Check it out here: https://github.com/rlucha/usda-nutrition-database-migration

查看更多
SAY GOODBYE
3楼-- · 2019-01-17 08:46

They have a version of that database for Microsoft access, if you have it you could download to access then convert to MySQL using Access To MySQL.

查看更多
再贱就再见
4楼-- · 2019-01-17 08:54

The zip contains a number of files:

  inflating: DATA_SRC.txt            
  inflating: DATSRCLN.txt            
  inflating: DERIV_CD.txt            
  inflating: FD_GROUP.txt            
  inflating: FOOD_DES.txt            
  inflating: FOOTNOTE.txt            
  inflating: LANGDESC.txt            
  inflating: LANGUAL.txt             
  inflating: NUT_DATA.txt            
  inflating: NUTR_DEF.txt            
  inflating: sr26_doc.pdf            
  inflating: SRC_CD.txt              
  inflating: WEIGHT.txt         

each of which appears to be in a bizarre almost-CSV-like format, e.g. NUTR_DEF.txt:

~287~^~g~^~GALS~^~Galactose~^~2~^~2100~
~291~^~g~^~FIBTG~^~Fiber, total dietary~^~1~^~1200~

plus sr26_doc.pdf, the documentation.

Creating table definitions

So what you need to do here is create SQL table definitions for the database - with one table for each input file. You need the CREATE TABLE command for this; see the PostgreSQL documentation.

Page 35 of the PDF should help you - "Figure 1. Relationships among files in the USDA National Nutrient Database for Standard Reference". The following pages describe the file formats, telling you what each column means. You can write CREATE TABLE statements based on this description.

Here's an example, for FOOD_DES.txt (food description), the first entry.

CREATE TABLE food_des (
    "NDB_No"      varchar(5) NOT NULL PRIMARY KEY,
    "FdGrp_Cd"    varchar(4) NOT NULL,
    "Long_Desc"   varchar(200) NOT NULL,
    "Shrt_Desc"   varchar(60) NOT NULL,
    "ComName"     varchar(100),
    "ManufacName" varchar(65),
    "Survey"      varchar(1),
    "Ref_desc"    varchar(135),
    "Refuse"      smallint,
    "SciName"     varchar(65),
    "N_Factor"    NUMERIC(4,2),
    "Pro_Factor"  NUMERIC(4,2),
    "Fat_Factor"  NUMERIC(4,2),
    "CHO_Factor"  NUMERIC(4,2)
);

That's a pretty literal copy of the description. It's not how I'd design the table

I've used NUMERIC arbitrary-precision decimal floating point types for accuracy in non-integer numeric types. If performance is more important than accuracy, you can use float4 instead.

For relationships, you use FOREIGN KEY constraints - just colname coltype REFERENCES othertable(othercol) is sufficient to create one.

Important: I double quoted the column names to preserve the same name as in the definitions. That means you have to always double quote them when you refer to them, e.g. SELECT "NDB_No" FROM food_des; . If you don't want that, just leave off the double quotes - or pick different names. You don't have to stick to the clumsy abbreviated column names they used, and it's quite reasonable to write:

CREATE TABLE food_description (
    ndb_no              varchar(5) NOT NULL PRIMARY KEY,
    foodgroup_code      varchar(4) NOT NULL,
    long_description    varchar(200) NOT NULL,
    short_description   varchar(60) NOT NULL,
    common_name         varchar(100),
    manufacturer_name   varchar(65),

etc. Similarly, if you're working with Rails, you can convert the table definitions to follow Rails's conventions, especially if you then intend to do the data loading via Rails.

Loading data

If these were sane, sensible delimited files you could then just load each table using the psql command \copy, or PgAdmin-III's "import" option.

It is actually CSV, they've just decided to use totally bizarre delimiter and quote chars. Import via psql with:

\copy food_des FROM 'FOOD_DES.txt' (FORMAT CSV, DELIMITER '^', QUOTE '~');

or the equivalent in whatever tool you use to talk to PostgreSQL.

The results are a sensible looking table:

craig=> select * from food_des limit 2;
 NDB_No | FdGrp_Cd |         Long_Desc          |        Shrt_Desc         | ComName | ManufacName | Survey | Ref_desc | Refuse | SciName | N_Factor | Pro_Factor | Fat_Factor | CHO_Factor 
--------+----------+----------------------------+--------------------------+---------+-------------+--------+----------+--------+---------+----------+------------+------------+------------
 01001  | 0100     | Butter, salted             | BUTTER,WITH SALT         |         |             | Y      |          |      0 |         |     6.38 |       4.27 |       8.79 |       3.87
 01002  | 0100     | Butter, whipped, with salt | BUTTER,WHIPPED,WITH SALT |         |             | Y      |          |      0 |         |     6.38 |       4.27 |       8.79 |       3.87
(2 rows)

Similarly, if using Rails you can use whatever Rails CSV library you want and bulk-load into models.

查看更多
登录 后发表回答