Best way to shorten UTF8 string based on byte leng

2019-01-25 07:31发布

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.

I ran into a problem where I'd receive this error message when inserting a particular field:

ORA-12899 Value too large for column X

I used Field.Substring(0, MaxLength); but still got the error (though not for every record).

Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.

This gets me to my question. What is the best way to trim my string to fix the MaxLength?

My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?

8条回答
何必那么认真
2楼-- · 2019-01-25 07:58

Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.

string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣" 
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
   bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
   --lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正 

And an even more efficient(and maintainable) solution: get the string from the bytes array according to desired length and cut the last character because it might be corrupted

string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣" 
int maxBytesLength = 30;    
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);

The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution

查看更多
虎瘦雄心在
3楼-- · 2019-01-25 08:01

If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.

Check out the Description section of the wikipedia article for more detail.

查看更多
登录 后发表回答