Character set encodings and storage size factors

2019-06-03 16:57发布

For Oracle, is there a matrix/tabulated info on storage allocations given various character encoding.

For example:

size_of_col_in(AL32UTF8) == 4 * size_of_col_in(WE8ISO8859P1)

I am interested in that factor value 4

1条回答
甜甜的少女心
2楼-- · 2019-06-03 17:30

Unless you are only interested in the maximum change, your example is not correct.

AL32UTF8 is a variable-length character set. Speaking in general terms, any character that is in the US7ASCII character set will take 1 byte, European characters generally require 2 bytes, various Asian language characters require 3 bytes, and a handful of very rare characters will require 4 bytes. Realistically, if you're talking about converting actual WE8ISO8859P1 data to AL32UTF8, you'll generally see in practice a conversion factor between 1 and 2 that is much closer to 1 than to 2. Without looking up the Unicode mapping for every valid WE8ISO8859P1 character, I'd be surprised if any required 3 or 4 bytes of storage in the AL32UTF8 character set.

In the Globalization Support Guide, there is a section on character sets that tells you which character sets are single-byte, which are multi-byte, and which of the multi-byte character sets are fixed width. Almost all the multi-byte character sets are variable width so the factor you are looking for will depend on your data.

In most cases, you're better off declaring your columns to use character length semantics rather than byte length semantics anyway, letting the database figure out behind the scenes how much data to allocate. For example, if you declare a column

CREATE TABLE foo (
  col1 VARCHAR2(10 CHAR)
)

Oracle will allocate space for 10 characters of storage regardless of the database character set and regardless of the actual number of bytes required to store that data (subject to the limit of 4000 bytes per VARCHAR2 column). That generally makes defining column sizes much easier because you don't have to oversize the columns in case someone decides to throw 10 4-byte UTF-8 characters into one row and you don't have to explain to users that the column will accept strings of different numbers of characters depending on the language and/or the specific characters chosen.

Although the Oracle folks that deal with globalization on a regular basis discourage it, preferring that you explicitly specify the character length semantics when you declare your columns or at least only set it at a session level, you can set the NLS_LENGTH_SEMANTICS initialization parameter to cause VARCHAR2(10) to use character length semantics rather than byte length semantics by default (you can still specify VARCHAR2(10 BYTE) if you want byte-length semantics).

查看更多
登录 后发表回答