XML data to PostgreSQL database

2019-01-04 12:48发布

What would be the best way to insert XML data (that I get from a webpage) into PostgreSQL database?
I'm using Java and need a little help finding a good way to read this data into the database.

3条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-01-04 13:17

Postgres has (thanks to Daniel Lyons for pointing it out) native XML support which you can use to store your table. If you however want to shred your XML data manually, there are different possibilities to represent XML data in a database. The first question should be, if you want a very generic solution, that will be able to store any XML document or one that is specific to your domain (i.e. only allows XML documents of a certain structure). Depending on that, you will have a very flexible, universal representation which is however harder to query (the SQL needed will be quite complicated). If you have a more specific approach, the queries will be simpler, but you will need to create new tables or add new attributes to existing talbes every time you want to store another type of document or add a field to an existing document; so changing the schema will be harder (which is one major advantage of XML). This presentation should give you some ideas what are the different possibilities.

Also, you might consider to switch to some DB that supports Xquery, like DB2. The ability to natively query using XQuery, a language targeted at processing XML, will simplify things a lot.

UPDATE: Given your comment, your XML data (that you linked to) is perfectly relational. It can be mapped 1:1 to the following table:

CREATE TABLE mynt (
    ID          SERIAL     ,
    myntnafn    CHAR(3)    ,
    myntheiti   Varchar(255) ,
    kaupgengi   Decimal(15,2) ,
    midgengi    Decimal(15,2) ,
    solugengi   Decimal(15,2) ,
    dagsetning  TimeStamp      
)

So any mynt tag would be a record in the table and the corresponding sub-tags the attributes. The data types I gathered from your data, they might be wrong. The main problem is, IMO, that there is no natural primary key, so I added an autogenerated one.

查看更多
男人必须洒脱
3楼-- · 2019-01-04 13:25

I have a working implementation where I do everything inside PostgreSQL without additional libraries.

Auxiliary parsing function

CREATE OR REPLACE FUNCTION f_xml_extract_val(text, xml)
  RETURNS text AS
$func$
SELECT CASE
        WHEN $1 ~ '@[[:alnum:]_]+$' THEN
           (xpath($1, $2))[1]
        WHEN $1 ~* '/text()$' THEN
           (xpath($1, $2))[1]
        WHEN $1 LIKE '%/' THEN
           (xpath($1 || 'text()', $2))[1]
        ELSE
           (xpath($1 || '/text()', $2))[1]
       END;
$func$  LANGUAGE sql IMMUTABLE STRICT;

Handle multiple values

The above implementation doesn't handle multiple attributes at one xpath. Here is an overloaded version of f_xml_extract_val() for that. With the 3rd parameter you can pick one (the first), all or dist (distinct) values. Multiple values are aggregated to a comma-separated string.

CREATE OR REPLACE FUNCTION f_xml_extract_val(_path text, _node xml, _mode text)
  RETURNS text AS
$func$
DECLARE
   _xpath text := CASE
                   WHEN $1 ~~ '%/'              THEN $1 || 'text()'
                   WHEN lower($1) ~~ '%/text()' THEN $1
                   WHEN $1 ~ '@\w+$'            THEN $1
                   ELSE                              $1 || '/text()'
                  END;
BEGIN

-- fetch one, all or distinct values
CASE $3
    WHEN 'one'  THEN RETURN (xpath(_xpath, $2))[1]::text;
    WHEN 'all'  THEN RETURN array_to_string(xpath(_xpath, $2), ', ');
    WHEN 'dist' THEN RETURN array_to_string(ARRAY(
         SELECT DISTINCT unnest(xpath(_xpath, $2))::text ORDER BY 1), ', ');
    ELSE RAISE EXCEPTION
       'Invalid $3: >>%<<', $3;
END CASE;

END
$func$ LANGUAGE plpgsql IMMUTABLE STRICT;

COMMENT ON FUNCTION f_xml_extract_val(text, xml, text) IS '
# extract element of an xpath from XML document
# Overloaded function to f_xml_extract_val(..)
$3 .. mode is one of: one | all | dist'

Call:

SELECT f_xml_extract_val('//city', x, 'dist');

Main part

Name of target table: tbl; prim. key: id:

CREATE OR REPLACE FUNCTION f_sync_from_xml()
  RETURNS boolean AS
$func$
DECLARE
   datafile text := 'path/to/my_file.xml';  -- only relative path in db dir
   myxml    xml  := pg_read_file(datafile, 0, 100000000);  -- arbitrary 100 MB max.
BEGIN

-- demonstrating 4 variants of how to fetch values for educational purposes
CREATE TEMP TABLE tmp ON COMMIT DROP AS
SELECT (xpath('//some_id/text()', x))[1]::text AS id    -- id is unique  
      ,f_xml_extract_val('//col1', x)          AS col1  -- one value
      ,f_xml_extract_val('//col2/', x, 'all')  AS col2  -- all values incl. dupes
      ,f_xml_extract_val('//col3/', x, 'dist') AS col3  -- distinct values
FROM   unnest(xpath('/xml/path/to/datum', myxml)) x;

-- 1.) DELETE?

-- 2.) UPDATE
UPDATE tbl t
SET   (  col_1,   col2,   col3) =
      (i.col_1, i.col2, i.col3)
FROM   tmp i
WHERE  t.id = i.id
AND   (t.col_1, t.col2, t.col3) IS DISTINCT FROM
      (i.col_1, i.col2, i.col3);

-- 3.) INSERT NEW
INSERT INTO tbl
SELECT i.*
FROM   tmp i
WHERE  NOT EXISTS (SELECT 1 FROM tbl WHERE id = i.id);

END
$func$  LANGUAGE plpgsql VOLATILE;

Important points:

  • This implementation checks on a primary key if the inserted row exists already and updates in this case. Only new rows are inserted.

  • I use a temporary staging table to speed up the procedure.

  • pg_read_file() has restrictions to it. I quote the manual:

    Use of these functions is restricted to superusers.

    And:

    Only files within the database cluster directory and the log_directory can be accessed.

So you have to put your source file there - or create a symbolic link to your actual file/directory.

Or you can provide the file via Java in your case (I did it all inside Postgres).

Or you can import the data into 1 column of 1 row of a temporary table and take it from there.

Or you can use lo_import like demonstrated in this related answer on dba.SE.

  • Tested with Postgres 8.4, 9.0 and 9.1.

  • XML has to be well-formed.

This blog post by Scott Bailey helped me.

查看更多
Explosion°爆炸
4楼-- · 2019-01-04 13:35

PostgreSQL has an XML datatype. There are lots of XML specific functions you can use to query and modify the data, such as with xpath.

From the Java side, you can pretend you're just working with strings, but know that the data is well-formed on the way out and it won't let you store non-well-formed data.

查看更多
登录 后发表回答