Bulk insert flat utf-8 with áåóñ

2019-07-25 14:16发布

问题:

I have a flat file witch I would like to bulk insert into SQL server (2016).

 BULK INSERT table FROM 'path\flatfile.dat'
   WITH 
      (
        FORMATFILE='path\fileformat.fmt', 
        CODEPAGE = '65001', 
        DATAFILETYPE = 'Char'
      );

Flat file has no field terminator:

                      9999030000000001001ab Baujahr                                                  0
                      9999030000000001004from construction year                                      0
                      9999030000000001006à partir d'année de construction                            0
                      9999030000000001007da anno                                                     0
                      9999030000000001008año construcción desde                                      0
                      9999030000000001009Vanaf bouwjaar                                              0
                      9999030000000001010fra byggeår                                                 0
                      9999030000000001011från årsmodell                                              0
                      9999030000000001012fra årsmodell                                               0
                      9999030000000001013Vuosimallista                                               0

but in format file I set the lenght of field

12.0
    7
    1       SQLCHAR           0       22      ""         1     Field1            ""
    2       SQLCHAR           0       4       ""         2     field2            ""
    3       SQLCHAR           0       3       ""         3     field3            ""
    4       SQLCHAR           0       9       ""         4     field4            ""
    5       SQLCHAR           0       3       ""         5     field             ""
    6       SQLCHAR           0       60      ""         6     Textfield         ""
    7       SQLCHAR           0       1       "\r\n"     7     flag              ""

The other files I have with similar format works great except this one. The error I'm getting is truncation of rows which contains caracters áåóñ...

    Msg 4863, Level 16, State 1, Line 6
Bulk load data conversion error (truncation) for row 3, column 7 (LFlag).
Msg 4863, Level 16, State 1, Line 6
Bulk load data conversion error (truncation) for row 5, column 7 (LFlag).
Msg 4863, Level 16, State 1, Line 6
Bulk load data conversion error (truncation) for row 7, column 7 (LFlag).
Msg 4863, Level 16, State 1, Line 6

any ideas ?

回答1:

It's unclear whether the use of format files is fully supported by BCP when working with UTF-8 - your experience suggests that it isn't. It appears that there's some confusion between byte count and character count when importing a UTF-8 file. I think this is possibly a bug in BCP.

With the following test table

USE tempdb
GO
CREATE table t1
(f1  nvarchar(max),
f2  nvarchar(max),
f3  nvarchar(max),
f4  nvarchar(max),
f5  nvarchar(max),
f6  nvarchar(max),
f7  nvarchar(max))

I was able to import your file using the format file in the question, and the same BULK INSERT command.

However, what's notable is that the final column (f7 in my table) is left-padded with a number of leading spaces equal to the number double-byte characters in column 6.

On row 3 (column 6 value à partir d'année de construction), column 7 is padded with two spaces. The same is true of rows 5 and 8, which also have two double-byte characters.

On row 9 (column 6 value fra årsmodell), column 7 is padded with one space. The same is true of row 7.

It looks like the byte-count and character-count are not being tracked correctly by BCP.

Assuming the format of your input file is fixed, the only option I can see to work around this issue would be to import the data file into a table with single nvarchar(max) column, then use LEFT, RIGHT and SUBSTRING statements to extract data values from the rows:

CREATE TABLE t2
(f1  nvarchar(max))

BULK INSERT t2 FROM 'file\location' WITH (CODEPAGE = '65001',DATAFILETYPE = 'Char' );

SELECT LEFT(f1,22), SUBSTRING(f1,23,4), SUBSTRING(f1,27,3),
       SUBSTRING(f1,30,9), SUBSTRING(f1,39,3), SUBSTRING(f1,42,60), 
       SUBSTRING(f1,102,1)
FROM t2

Edit

Tested an XML format file, which doesn't work correctly either. Although using an XML format file allows the file to load without error, the value for column 7 is lost and replaced by a space.

Format file:

<?xml version="1.0"?>  
<BCPFORMAT  
       xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"  
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
  <RECORD>  
    <FIELD ID="1" xsi:type="CharFixed" LENGTH="22"/>  
    <FIELD ID="2" xsi:type="CharFixed" LENGTH="4"/>  
    <FIELD ID="3" xsi:type="CharFixed" LENGTH="3"/>  
    <FIELD ID="4" xsi:type="CharFixed" LENGTH="9"/>  
    <FIELD ID="5" xsi:type="CharFixed" LENGTH="3"/>  
    <FIELD ID="6" xsi:type="CharFixed" LENGTH="60"/>  
    <FIELD ID="7" xsi:type="CharFixed" LENGTH="1"/>  
    <FIELD ID="8" xsi:type="CharTerm" TERMINATOR="\r\n"  />
  </RECORD>  
  <ROW>  
    <COLUMN SOURCE="1" NAME="C1" xsi:type="SQLCHAR" />  
    <COLUMN SOURCE="2" NAME="C2" xsi:type="SQLCHAR" />  
    <COLUMN SOURCE="3" NAME="C3" xsi:type="SQLCHAR" />  
    <COLUMN SOURCE="4" NAME="C4" xsi:type="SQLCHAR" />  
    <COLUMN SOURCE="5" NAME="C5" xsi:type="SQLCHAR" />  
    <COLUMN SOURCE="6" NAME="C6" xsi:type="SQLCHAR" />  
    <COLUMN SOURCE="7" NAME="C7" xsi:type="SQLCHAR" />  
  </ROW>  
</BCPFORMAT>

(The BULK INSERT command is the same as in the question, other than the format file name.)