What is the best way to handle uploaded text files

Internally our PHP application uses UTF-8, and we do processing on .csv files and fixedwidth (text) files. We have written some nice libraries to work with these files (classes essentially).

We recently added the ability for administrators to upload files of these types so they could be processed and quickly ran into issues across multiple OS's. What we soon realised is that the files being read in were of different encodings to our application (i.e Windows-1252 or ISO-8859).

Since it is impossible to control what encoding of files are submitted to us my question is; what is the best way to handle uploaded text files of different encodings? I can think of two solutions currently:

When a file is received, detect its encoding and convert it to UTF-8, then re-save it. The rest of the system then only needs to be UTF-8 aware and can ignore 'encoding' issues.
Change the csv / fixed width library so they become encoding aware themselves

I also thought about the pro's and con's of these too:

Converting input makes the rest of the libraries smaller and reduces duplication, however it seems wasteful in terms of processing
Make libraries internally aware - this seems to involve more code but might be more speedy

Thoughts please?

Edit: I am really interested to know where to apply, architecturally, character encoding/transforming should happen - is it at the point of input or during the use of the files?

标签： php text encoding utf-8 character-encoding

2条回答

孤傲高冷的网名

2楼-- · 2019-07-19 18:51

This is tricky, and there is no perfect solution.

phpMyAdmin for example offers the user the possibility to specify the encoding of the uploaded file. Seeing as all the automatic detection methods are not 100% reliable, if at all possible, this is the best way to go IMO.

An import dialog that allows the user to select the right encoding while seeing a preview of what their data looks like in that encoding might be optimal.

A way to do this could be

Receive the uploaded file and store it in a temporary file
Display a dialog with a drop-down selection of the most important encodings
Have an iframe that, when the selected value in the drop-down changes, converts the contents of the uploaded file using iconv() (source = the selected encoding; target = utf-8) and shows a preview.
When the user selects an encoding, do a final iconv() and store the file as UTF-8.

0人赞添加讨论(0) 举报

地球回转人心会变

3楼-- · 2019-07-19 19:02

Automatic encoding detection for CSV can be difficult, based on my own experience. It's reliable only for a small subset of encodings (such as the UTF family and a few others). In that regards, Pekka's suggestions aim in the right direction - by placing the burden of identifying the correct encoding on the end-user.

Keeping UTF8 as the internal format is a good idea but I suggest keeping the charset issues separate from CSV processing since the format itself has no rules about encoding. While it's true that decoding on-the-fly is somewhat more efficient, the increase in code complexity might not justify the gain. Keeping the software components specialized is always a good idea.

Character transformations should happen inside the server-side controller, before handing control over to the CSV processor, provided the system adheres to MVC.

0人赞添加讨论(0) 举报

What is the best way to handle uploaded text files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间