I have an Oracle db which stores the content of documents originating from all over the world, with different languages. The documents are stored in a table with a BLOB column which stores the documents' content.
I want to find out what the char set is for every doc, with an Oracle procedure. I don't want to use the utility CSSCAN since it seems you have to use it in a separate session, outside of your procedure.
Thanks for your help!
Oracle Globalization Development Kit can detect character sets.
The GDK is included with Oracle but it is not installed in the database by default. To load the .jar files into the database find the jlib directory in the Oracle home and run this operating system command:
Some extra Java privileges are needed, even if your user has DBA. Run this command and then re-connect:
Create a Java class to do the detection. Below is a very simple example that returns the best guess for a string:
Wrap the Java class in a PL/SQL function:
I simulated different character sets by translating a string into different languages, saving the text as different encodings with a text editor, opening the file with hex editor, and converting the hex into a BLOB:
That trivial example works well but I don't know how well it will work with real-world files. There are a lot of features in the GDK, the above code is only a simple starting point. With only minor changes the code can also detect languages as demonstrated in my answer here.