What are the best settings for scanner in order to scan documents(white & black text) and use them for OCR conversion(for best results) and what are standard settings and specification for PDF and TIFF format ?
问题:
回答1:
For OCR, best scanning settings are:
- 300 dpi resolution for regular text, 400 dpi resolution for particularly small fonts (fine print)
- Black & white for text, greyscale for small fonts, color for pictures
- TIFF format. Group4 is used for black & white (very small file size). If color is needed, use Uncompressed (very large file size).
Some OCR technologies may have special preferences, which may slightly help, but they are usually minor.
回答2:
For OCR purpose, I would scan a document at 300DPI, B/W or grayscale, and uncompressed TIFF or PNG format.
回答3:
While 300DPI is optimal for "perfect" inputs, if you are working with imperfect inputs (e.g. from a typewriter or dot-matrix printer), then the high resolution will actually throw tesseract off. In cases like this, it is better to use a lower resolution to sort of hide the imperfections. E.g. with a dot-matrix printer I get significantly better results at 150dpi than 300dpi.
回答4:
If you want a general answer, 300 DPI is good. The best OCR results usually for B/W images and if your image quality is low, you might improve it by applying image processing.
Also, if you are saving the scanned image then feeding it to the OCR engine, do NOT use lossy compression like JPEG. Note that there is a lossless JPEG compression but it is not commonly supported.