Have a pretty straight forward SSIS package:
- OLE DB Source to get data via a view, (all string columns in db table nvarchar or nchar).
- Derived Column to format existing date and add it on to the dataset, (data type DT_WSTR).
- Multicast task to split the dataset between:
- OLE DB Command to update rows as "processed".
- Flat file destination - the connection manager of which is set to Code Page 65001 UTF-8 and Unicode is unchecked. All string columns map to DT_WSTR.
Everytime I run this package an open the flat file in Notepad++ its ANSI, never UTF-8. If I check the Unicode option, the file is UCS-2 Little Endian.
Am I doing something wrong - how can I get the flat file to be UTF-8 encoded?
Thanks
In Source -> Advance Editor -> Component Properties ->
Set Default Code Page to 65001
AlwaysUseDefaultCodePage to True
Then Source->Advance Editor -> Input And OutPut Properties
Check Each Column in External Columns and OutPut Columns and set CodePage to 65001 wherever possible.
That's it.
By the way Excel can not define data inside the file to be UTF - 8. Excel is just a file handler. You can create csv files using notepad also. as long as you fill the csv file with UTF-8 you should be fine.
Adding explanation to the answers ...
setting the CodePage to 65001 (but do NOT check the Unicode checkbox on the file source), should generate a UTF-8 file. (yes, the data types internally also should be nvarchar, etc).
But the file that is produced from SSIS does not have a BOM header (Byte Order Marker), so some programs will assume it is still ASCII, not UTF-8. I've seen this confirmed by MS employees on MSDN, as well as confirmed by testing.
The file append solution is a way around this - by creating a blank file WITH the proper BOM, and then appending data from SSIS, the BOM header remains in place. If you tell SSIS to overwrite the file, it also loses the BOM.
Thanks for the hints here, it helped me figure out the above detail.
I have recently worked on a problem where we come across a situation such as the following:
You are working on a solution using SQL Server Integration Services(Visual Studio 2005).
You are pulling data from your database and trying to place the results into a flat file (.CSV) in UTF-8 format. The solution exports the data perfectly and keeps the special characters in the file because you have used 65001 as the code page.
However, the text file when you open it or try to load it to another process, it says the file is ANSI instead of UTF-8. If you open the file in notepad and do a SAVE AS and change the encode to UTF-8 and then your external process works but this is a tedious manual work.
What I have found that when you specify the Code Page property of the Flat file connection manager, it do generates a UTF-8 file. However, it generates a version of the UTF-8 file which misses something we call as Byte Order Mark.
So if you have a CSV file containing the character AA, the BOM for UTF8 will be 0xef, 0xbb and 0xbf. Even though the file has no BOM, it’s still UTF8.
Unfortunately, in some old legacy systems, the applications search for the BOM to determine the type of the file. It appears that your process is also doing the same.
To workaround the problem you can use the following piece of code in your script task which can be ran after the export process.
using System.IO;
using System.Text;
using System.Threading;
using System.Globalization;
enter code here
static void Main(string[] args)
{
string pattern = "*.csv";
string[] files = Directory.GetFiles(@".\", pattern, SearchOption.AllDirectories);
FileCodePageConverter converter = new FileCodePageConverter();
converter.SetCulture("en-US");
foreach (string file in files)
{
converter.Convert(file, file, "Windows-1252"); // Convert from code page Windows-1250 to UTF-8
}
}
class FileCodePageConverter
{
public void Convert(string path, string path2, string codepage)
{
byte[] buffer = File.ReadAllBytes(path);
if (buffer[0] != 0xef && buffer[0] != 0xbb)
{
byte[] buffer2 = Encoding.Convert(Encoding.GetEncoding(codepage), Encoding.UTF8, buffer);
byte[] utf8 = new byte[] { 0xef, 0xbb, 0xbf };
FileStream fs = File.Create(path2);
fs.Write(utf8, 0, utf8.Length);
fs.Write(buffer2, 0, buffer2.Length);
fs.Close();
}
}
public void SetCulture(string name)
{
Thread.CurrentThread.CurrentCulture = new CultureInfo(name);
Thread.CurrentThread.CurrentUICulture = new CultureInfo(name);
}
}
when you will run the package you will find that all the CSVs in the designated folder will be converted into a UTF8 format which contains the byte order mark.
This way your external process will be able to work with the exported CSV files.
if you are looking only for particular folder...send that variable to script task and use below one..
string sPath;
sPath=Dts.Variables["User::v_ExtractPath"].Value.ToString();
string pattern = "*.txt";
string[] files = Directory.GetFiles(sPath);
I hope this helps!!
OK - seemed to have found an acceptable work-around on SQL Server Forums. Essentially I had to create two UTF-8 template files, use a File Task to copy them to my destination then make sure I was appending data rather than overwriting.