Oracle SQL-Loader handling efficiently internal Do

2019-07-30 14:46发布

I have some Oracle SQL Loader challenges and looking for an efficient and simple solution. my source files to be loaded are pipe | delimited, where values are enclosed by Double Quotes ". the problem seems to be that some of the values contains internal Double Quotes.

e.g.: ..."|"a":"b"|"...

this causes my records to be rejected under the excuse of:

no terminator found after TERMINATED and ENCLOSED field

there are various solutions over the web but non seems to fit:

[1] I have tried to replace all internal double quotes in quoting the quotes, but it seems that when applying this function on too many fields on the control files (I have ~2000+ fields and using FILLER to load only a subset) the loader complains again:

SQL*Loader-350: Syntax error at line 7.
Expecting "," or ")", found ",".
field1  char(36) "replace(:field1,'"','""')",

(I do not know why but when applying this solution on a narrow subset of columns it does seem to work)

thing is that potentially all fields may include internal double quotes.

[2] I'm able to load all data when omitting the global optionally enclosed by '"', but then all enclosing quotes becomes part of the data in the target table.

[3] I can omit the global optionally enclosed by '"' statement and place it only at selected fields, while try to "replace(:field1,'"','""')" statement on the remainder, but this is difficult to implement, as I cannot know ahead what are the suspected fields to include internal double quotes.

here are my questions:

  1. is there no simple way to convince the loader to handle with care internal double quotes (when values are enclosed by them)?

  2. if I'm forced to fix the data ad-hock, is there a one liner Linux command to convert only internal double quotes to another string/char, say, single quotes?

  3. if I'm forced to load data with the quotes to the target table, is there a simple way to remove the enclosing double quotes from all fields, all at once (the table has ~1000 columns). is the solution practical performance wise to very large tables?

1条回答
Luminary・发光体
2楼-- · 2019-07-30 14:53

If you never had pipes in the enclosed fields you could do it from the control file. If you can have both pipes and double-quotes within a field then I think you have no choice but to preprocess the files, unfortunately.

Your solution [1], to replace double-quotes with an SQL operator, is happening too late to be useful; the delimiters and enclosures have already been interpreted by SQL*Loader before it does the SQL step. Your solution [2], to ignore the enclosure, would work in combination with [1] - until one of the fields did contain a pipe character. And solution [3] has the same problems as using [1] and/or [2] globally.

The documentation for specifying delimiters mentions that:

Sometimes the punctuation mark that is a delimiter must also be included in the data. To make that possible, two adjacent delimiter characters are interpreted as a single occurrence of the character, and this character is included in the data.

In other words, if you repeated the double-quotes inside the fields then they would be escaped and would appear in the table data. As you can't control the data generation, you could preprocess the files you get to replace all the double-quotes with escaped double quotes. Except you don't want to replace all of them - the ones that are actually real enclosures should not be escaped.

You could use a regular expression to target the relevant characters will skipping others. Not my strong area, but I think you can do this with lookahead and lookbehind assertions.

If you had a file called orig.txt containing:

"1"|A|"B"|"C|D"
"2"|A|"B"|"C"D"
3|A|""B""|"C|D"
4|A|"B"|"C"D|E"F"G|H""

you could do:

perl -pe 's/(?<!^)(?<!\|)"(?!\|)(?!$)/""/g' orig.txt > new.txt

That looks for a double-quote which is not preceded by the line-start anchor or a pipe character; and is not followed by a pipe character or line end anchor; and replaces only those with escaped (doubled) double-quotes. Which would make new.txt contain:

"1"|A|"B"|"C|D"
"2"|A|"B"|"C""D"
3|A|"""B"""|"C|D"
4|A|"B"|"C""D|E""F""G|H"""

The double-quotes at the start and end of fields are not modified, but those in the middle are now escaped. If you then loaded that with a control file with double-quote enclosures:

load data
truncate
into table t42
fields terminated by '|' optionally enclosed by '"'
(
  col1,
  col2,
  col3,
  col4
)

Then you would end up with:

select * from t42 order by col1;

      COL1 COL2       COL3       COL4                
---------- ---------- ---------- --------------------
         1 A          B          C|D                 
         2 A          B          C"D                 
         3 A          "B"        C|D                 
         3 A          B          C"D|E"F"G|H"        

which hopefully matches your original data. There may be edge cases that don't work (like a double-quote followed by a pipe within a field) but there's a limit to what you can do to attempt to interpret someone else's data... There may also be (much) better regular expression patterns, of course.


You could also consider using an external table instead of SQL*Loader, if the data file is (or can be) in an Oracle directory and you have the right permissions. You still have to modify the file, but you could do it automatically with the preprocessor directive, rather than needing to do that explicitly before calling SQL*Loader.

查看更多
登录 后发表回答