BULK INSERT / OPENROWSET FormatFile Terminator for

2019-08-11 18:37发布

I've written a nice import for my million row CSV that works quite nicely (using OPENROWSET BULK (I didn't use BULK INSERT because I need to cross join with some other columns). The formatfile uses a comma as the Terminator.

Here is an example of the CSV I was using to develop:

Reference, Name, Street
1,Dave Smith, 1 Test Street
2,Sally SMith,1 Test Street

Once I'd got it working, someone reminded me that the data itself could have a comma in it, whoops!!!!

Reference, Name, Street
"1","Dave Smith", "1 Test Street"
"2","Sally Smith","1,Test Street" <-comma in street

How do people deal with a CSV with a comma in the data using FormatFiles? (Or do I just say that the file must be TAB delimited)?

2条回答
欢心
2楼-- · 2019-08-11 18:48

Delimiters within data fields is a common problem with delimited files. Some common tactics to address this include:

  1. Recreate the data file with all occurrences of the delimiter stripped out of the data fields before they are written to file: this eliminates OPENROWSET errors, but does not preserve the integrity of the data.
  2. Recreate the data file with a different delimiter character: in my experience, a tab delimiter is a better choice. It's less common to encounter a tab character than a comma within data. But it's certainly not unheard of. I've seen tabs within data too.
  3. Enclose data fields in double quotes: this requires some tweaks to the XML format file.

Manually editing the data file might be do-able for any of the above options. But it can be tedious, especially for large files. (Just opening a file of several GB's in Notepad.exe is an exercise in patience.) Realistically, you'd want the author to re-create it for you. Option #1 should always "work". But again, there's the data integrity issue you may not be able to live with. Option #2 will probably work for many cases, but it's not bulletproof. Option #3 isn't bulletproof either (it's always possible to have a delimiter within a data field), but it's about as close as you can get. Plus, it preserves data integrity.

Here is one possibility for your XML format file:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' MAX_LENGTH="5"/>
  <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' MAX_LENGTH="128" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='"\r\n' MAX_LENGTH="128" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="Reference" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="2" NAME="Name" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="3" NAME="Street" xsi:type="SQLVARYCHAR"/>
 </ROW>
</BCPFORMAT>

Notice the FIELD TERMINATOR: I used single quotes to identify "," as the terminator and "\r\n as the row terminator (the COLUMN 3 terminator). I made an educated guess that Name and Street are up to 128 characters--edit that as needed.

Problems:

  1. OPENROWSET() queries will return Reference with a leading " double quote character. And because of that...
  2. Reference cannot be returned as an INT (or SMALLINT, BIGINT, etc.). It gets returned as a VARCHAR (xsi:type="SQLVARYCHAR")

For the particular data sample provided, I'd remove the double-quotes from Reference data fields, adjust the XML format file so that FIELD ID="1" has TERMINATOR=',"', and further adjust the XML format file so that COLUMN SOURCE="1" has xsi:type="SQLINT".

For some additional info, this blog post may help: Getting Started With OPENROWSET and the BULK Rowset Provider - Part 2

查看更多
Bombasti
3楼-- · 2019-08-11 19:01

If your field terminator can occur in the data, then it may be best to use TAB or PIPE delimiter (or whatever works for your data).

If a terminator character occurs within the data, it is interpreted as a terminator, not as data, and the data after that character is interpreted as belonging to the next field or record. Therefore, choose your terminators carefully to make sure that they never appear in your data

http://msdn.microsoft.com/en-us/library/ms191485.aspx

查看更多
登录 后发表回答