I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode
utility for that. How can I specify that the recode
utility should only convert windows-1252 encoded files and not the UTF-8 files?
Example usage of recode:
recode windows-1252.. myfile.txt
This would convert myfile.txt
from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt
is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.
you can use iconv:
iconv -f WINDOWS-1252 -t UTF-8 filename.txt
Found this documentation for the TYPE command:
Convert an ASCII (Windows1252) file into a Unicode (UCS-2 le) text file:
The technique above (based on a script by Carlos M.) first creates a file with a Byte Order Mark (BOM) and then appends the content of the original file. CHCP is used to ensure the session is running with the Windows1252 code page so that the characters 0xFF and 0xFE (ÿþ) are interpreted correctly.
UTF-8 does not have a BOM as it is both superfluous and invalid. Where a BOM is helpful is in UTF-16 which may be byte swapped as in the case of Microsoft. UTF-16 if for internal representation in a memory buffer. Use UTF-8 for interchange. By default both UTF-8, anything else derived from US-ASCII and UTF-16 are natural/network byte order. The Microsoft UTF-16 requires a BOM as it is byte swapped.
To covert Windows-1252 to ISO8859-15, I first convert ISO8859-1 to US-ASCII for codes with similar glyphs. I then convert Windows-1252 up to ISO8859-15, other non-ISO8859-15 glyphs to multiple US-ASCII characters.
Use the iconv command.
To make sure the file is in Windows-1252, open it in Notepad (under Windows), then click Save As. Notepad suggests current encoding as the default; if it's Windows-1252 (or any 1-byte codepage, for that matter), it would say "ANSI".
There's no general way to tell if a file is encoded with a specific encoding. Remember that an encoding is nothing more but an "agreement" how the bits in a file should be mapped to characters.
If you don't know which of your files are actually already encoded in UTF-8 and which ones are encoded in windows-1252, you will have to inspect all files and find out yourself. In the worst case that could mean that you have to open every single one of them with either of the two encodings and see whether they "look" correct -- i.e., all characters are displayed correctly. Of course, you may use tool support in order to do that, for instance, if you know for sure that certain characters are contained in the files that have a different mapping in windows-1252 vs. UTF-8, you could grep for them after running the files through 'iconv' as mentioned by Seva Akekseyev.
Another lucky case for you would be, if you know that the files actually contain only characters that are encoded identically in both UTF-8 and windows-1252. In that case, of course, you're done already.
How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.
Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.
One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.
I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.
Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.
Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.