A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength);
but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that
nth
byte. Even in another language, we're less than 6 bytes away from a good cutting point.So I'm going to start from @Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the
n+1th
byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.Three possibilities
If the first byte after the cut has a
0
in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.If I have a
11
following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!If I have a
10
, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with
11
and cut just before it.Code
Notes: I'm using stuff like
Convert.ToByte("11000000", 2)
so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm&
ing to return what's in the byte's first two bits and bringing back0
s for the rest. Then I check theXX
fromXX000000
to see if it's10
or11
, where appropriate.I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The
PadLeft
is just because I'm overly OCD about output to the Console.So here's a function that'll cut you down to a string that's
n
bytes long or the greatest number less thann
that's ends with a "complete" UTF8 character.I initially wrote this as a string extension. Just add back the
this
beforestring str
to put it back into extension format, of course. I removed thethis
so that we could just slap the method intoProgram.cs
in a simple console app to demonstrate.Test and expected output
Here's a good test case, with the output it create below, written expecting to be the
Main
method in a simple console app'sProgram.cs
.Output follows. Notice that the "smart quotes" in
testValue
are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the?
s output for the second and third bytes of each smart quote in the output.The first five characters of our
testValue
are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for8
.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
This is another solution based on binary search:
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter
NLS_LENGTH_SEMANTICS
to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional
for
-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.