Internally our PHP application uses UTF-8, and we do processing on .csv files and fixedwidth (text) files. We have written some nice libraries to work with these files (classes essentially).
We recently added the ability for administrators to upload files of these types so they could be processed and quickly ran into issues across multiple OS's. What we soon realised is that the files being read in were of different encodings to our application (i.e Windows-1252 or ISO-8859).
Since it is impossible to control what encoding of files are submitted to us my question is; what is the best way to handle uploaded text files of different encodings? I can think of two solutions currently:
- When a file is received, detect its encoding and convert it to UTF-8, then re-save it. The rest of the system then only needs to be UTF-8 aware and can ignore 'encoding' issues.
- Change the csv / fixed width library so they become encoding aware themselves
I also thought about the pro's and con's of these too:
- Converting input makes the rest of the libraries smaller and reduces duplication, however it seems wasteful in terms of processing
- Make libraries internally aware - this seems to involve more code but might be more speedy
Thoughts please?
Edit: I am really interested to know where to apply, architecturally, character encoding/transforming should happen - is it at the point of input or during the use of the files?
This is tricky, and there is no perfect solution.
phpMyAdmin for example offers the user the possibility to specify the encoding of the uploaded file. Seeing as all the automatic detection methods are not 100% reliable, if at all possible, this is the best way to go IMO.
An import dialog that allows the user to select the right encoding while seeing a preview of what their data looks like in that encoding might be optimal.
A way to do this could be
Receive the uploaded file and store it in a temporary file
Display a dialog with a drop-down selection of the most important encodings
Have an iframe that, when the selected value in the drop-down changes, converts the contents of the uploaded file using
iconv()
(source = the selected encoding; target = utf-8) and shows a preview.When the user selects an encoding, do a final
iconv()
and store the file as UTF-8.Automatic encoding detection for CSV can be difficult, based on my own experience. It's reliable only for a small subset of encodings (such as the UTF family and a few others). In that regards, Pekka's suggestions aim in the right direction - by placing the burden of identifying the correct encoding on the end-user.
Keeping UTF8 as the internal format is a good idea but I suggest keeping the charset issues separate from CSV processing since the format itself has no rules about encoding. While it's true that decoding on-the-fly is somewhat more efficient, the increase in code complexity might not justify the gain. Keeping the software components specialized is always a good idea.
Character transformations should happen inside the server-side controller, before handing control over to the CSV processor, provided the system adheres to MVC.