I am running Oracle database 11g on RHEL6. If is no client NLS_LANG set the length of some utf8 character is 2. After NLS_LANG=AMERICAN_AMERICA.UTF8 is set the length is only 1.
How can bet the default NLG_LANG changed for whole database? I don't want to change this value only for session, or like enviromental variable in linux. Is there any possibility to set it general for database?
SQL> select length('á') from dual;
LENGTH('??')
------------
2
# export NLS_LANG=AMERICAN_AMERICA.UTF8
SQL> select length('á') from dual;
LENGTH('á')
-----------
1
Many thanks for any idea
This is what probably happens:
- Your client character set must be something like CP1252 or ISO-8859-15 whereas in fact your client really uses UTF8.
- In this character set (UTF8), the symbol
á
takes two bytes, so your client send these two bytes, while telling Oracle to treat these as CP1252. In CP1252 the two bytes code for two characters which result in the DB interpreting the input as two characters, therefore length('à')
equals 2 (and if you insert this string the result of the insert is not equals to à
)
- When you correctly set the character set, the input is correctly treated by oracle as a single character, and its length is 1 (still two bytes).
Conclusion: set your client character set properly or you will get translation errors (you won't get illegal characters this way but you may get strange symbols (¿
).
The database character set is set at the time of creation and is generally changed via export/create blank database/import.
Since the database character set is AL32UTF8, when you set the client NLS_LANG
to UTF8, you are telling Oracle to bypass the character set conversion that normally takes place when data is transmitted between the client and the server. That is extremely dangerous because it means that if the client sends data in any other encoding, there is a substantial risk that invalid data will be stored in the database. In this case, the LENGTH
call that returns 1 is sending incorrectly encoded data to the database. Most likely, the client operating system represents data using something like the ISO-8859-1 character set where á is a single-byte character (binary value 0xE1). When the client sends the data to the database, it is sending the 0xE1 and telling the database "trust me, this is valid UTF-8 data". The database doesn't bother to check the binary data to notice that 0xE1 is not a valid binary representation of any character in the UTF-8 character set. But now if some other client comes along that is correctly configured and asks for character set translation to take place and the database has a 0xE1 stored in some column, the character set conversion code will run, identify that 0xE1 is not a valid UTF-8 character, and return a replacement character (generally '?') rather than á to the correctly configured client.
Since the database character set is UTF-8, you would expect that an 'á' character would have a length of 2. UTF-8 is a variable-width character set. Characters that are part of the US7ASCII character set are represented with 1 byte, most Western European characters such as the 'á' are represented using 2 bytes, and most Asian characters are represented using 3 bytes. There are a few rare characters that require 4 bytes.
Your NLS_LANG
needs to be set to match the character set that your client system supports. It is rare that you'd have a client system that has native UTF-8 support. Since your client setting will override anything set at the database level, that means that each client needs to be configured to set an appropriate environment variable. There is a section in the NLS_LANG FAQ
on how to determine the correct NLS_LANG
setting for a Unix client.