I need to export a fairly large SQL Server table ~100GB to a CSV file. But rather than the output be a single csv file, it should ideally be multiple files say 10 files each 10GB.
I see BCP has a batch_size argument but this still writes all data to a single file? Are there other free utilities for doing what I require? Either where the size of file can be specified in bytes or number of rows?
For bit of context this is so the data can be combined with other sources in a Hive/Hadoop platform, so if there are better ways of exporting the data I'm open for suggestions.
BCP's batch_size argument does not control the output, unfortunately.
Ways I've done this type of splitting:
1 - Simple but non-repeatable: Create a command file (.cmd) that runs a multiple
BCP
s over the table for specific row ranges. This probably requires anIDENTITY(1,1)
based primary key on the table.2 - Simple and repeatable, uses a lot of disk:
BCP
out the entire table to a single file and usesplit
to create as many new files as needed with a given number of bytes in each (note: splitting by lines would be a better idea IMO). Use 'Cygwin' (GnuWin32 no longer maintained) to installsplit
and any other utilities you want.Generates the following files
3 - Complex but repeatable, requires possibly insecure T-SQL: Use the
xp_cmdshell
function to call BCP inside a stored procedure that iterates through the table.FINAL NOTE: If you are using any NVARCHAR fields in your data then you need to use the
-w
flag and be aware that the output will be in UTF-16LE. I would strongly recommend converting that to UTF-8 usingiconv
(from 'Cygwin' again) before trying to do anything with it in Hadoop.I would export the file first and then divide it externally. Assuming you are running on a Windows machine, there are several "free-ware" tools out there that can help. See this other answer on Superuser for more info.
If you've got a single sortable primary key field, you could find the keys that define the boundaries of the rows you want, then just SELECT records using a WHERE that defines the boundaries.
It's similar to what Joe put for #1, but your keys need not be contiguous nor numeric. The following is a simple example:
I think you could use SQL 2012's paging functions
OFFSET
andFETCH
in conjunction with bcp: