I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.
The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8">
is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text);
but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/
For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).
I've read the other SO questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").
But there must be something that at least has a good try!
I don't think it's a problem. An application knows the source of the input. If it's from a form, use UTF-8 encoding in your case. That works. Just verify the data provided is correctly encoded (validation). Keep in mind that not all databases support UTF-8 in it's full range.
If it's a file you won't save it UTF-8 encoded into the database but in binary form. When you output the file again, use binary output as well, then this is totally transparent.
Your idea is nice that a user can tell the encoding, be he/she can tell anyway after downloading the file, as it's binary.
So I must admit I don't see a specific issue you raise with your question. But maybe you can add some more details what your problem is.
You could set up a set of metrics to try to guess which encoding is being used. Again, not perfect, but could catch some of the misses from mb_detect_encoding().
You've probably tried this to but why not just use the mb_convert_encoding function? It will attempt to auto-detect char set of the text provided or you can pass it a list.
Also, I tried to run:
and the results are the same for both. How do you see that your text is truncated to 'fianc'? is it in the DB or in a browser?
In motherland Russia we have 4 popular encodings, so your question is in great demand here.
Only by char codes of symbols you can not detect encoding, because code pages intersect. Some codepages in different languages have even full intersection. So, we need another approach.
The only way to work with unknown encodings is working with probabilities. So, we do not want to answer the question "what is encoding of this text?", we are trying to understand "what is most likely encoding of this text?".
One guy here in popular Russian tech blog invented this approach:
Build the probability range of char codes in every encoding you want to support. You can build it using some big texts in your language (e.g. some fiction, use Shakespeare for english and Tolstoy for russian, lol ). You will get smth like this:
Next. You take text in unknown encoding and for every encoding in your "probability dictionary" you search for frequency of every symbol in unknown-encoded text. Sum probabilities of symbols. Encoding with bigger rating is likely the winner. Better results for bigger texts.
If you are interested, I can gladly help you with this task. We can greatly increase the accuracy by building two-charcodes probabilty list.
Btw. mb_detect_encoding certanly does not work. Yes, at all. Please, take a look of mb_detect_encoding source code in "ext/mbstring/libmbfl/mbfl/mbfl_ident.c".