I have a PHP file that I created with VIM, but I'm not sure which is its encoding.
When I use the terminal and check the encoding with the command file -bi foo
(My operating system is Ubuntu 11.04) it gives me the next result:
text/html; charset=us-ascii
But, when I open the file with gedit it says its encoding is UTF-8.
Which one is correct? I want the file to be encoded in UTF-8.
My guess is that there's no BOM in the file and that the command file -bi
reads the file and doesn't find any UTF-8 characters, so it assumes that it's ascii, but in reality it's encoded in UTF-8.
Based on @Celada answer and the @Arthur Zennig, I have created this simple script:
Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.
That being said,
file
typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.Now to your question:
Run this command:
If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.
Run this command
If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).
If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.
(on Linux)
it also delivers the confidence level [0-1] of the output.