Detect encoding by BOM / its absense

2019-09-19 04:22发布

问题:

I am using this code in a batch script to replace text in a file and then move the file to a location. This code is contained within a loop and reads in variables with each pass.

powershell -Command "(gc %inputPath%\%inputFile%) -replace 'Foo', '%bar%' | Out-File '%outputPath%\%outputFile%' -encoding default"

I ran into an issue with all the files being encoded as Unicode (UCS-2 Little Endian) since I lacked the "-encoding default" argument. After adding that argument, I have no problem with ANSI files, but some are UTF-8, and I'm getting the same problems.

These files are configs for executables, and they can be VERY picky about the encoding of their configs.

I've searched a good bit for a way to read what type for encoding the input is, and I have been unable to find a batch solution that works. Does batch have a means of reading encoding?

I'll accept powershell solutions, but ONLY if they can be executed from within the batch file. I'd prefer not to use external modules, but may have to if it's the only way.

回答1:

Create a normal ascii text file named dummy.txt and just put two characters in it. I usually just put AA. Then do a binary compare of your two files.

fc /b LIttleEndian.txt dummy.txt

You will then see this as your output

Comparing files LIttleEndian.txt and DUMMY.TXT
00000000: FF 41
00000001: FE 41
FC: LIttleEndian.txt longer than DUMMY.TXT

For UTF8 you will see this.

C:\BatchFiles\Encoding>fc /b utf8.txt dummy.txt
Comparing files UTF8.txt and DUMMY.TXT
00000000: EF 41
00000001: BB 41
FC: UTF8.txt longer than DUMMY.TXT

Use a FOR /F command to parse the output and that should help you determine the encoding used for your input file.

For ascii text the hex codes would start with numbers.

C:\BatchFiles\Encoding>fc /b Normaltext.txt dummy.txt
Comparing files Normaltext.txt and DUMMY.TXT
00000000: 4E 41
00000001: 6F 41
FC: Normaltext.txt longer than DUMMY.TXT


回答2:

Here's one more way relaying on certutil command:

@echo off
:detect_encoding
setLocal
if "%1" EQU "-?" (
    endlocal
    call :help
    exit /b 0
)
if "%1" EQU "-h" (
    endlocal
    call :help
    exit /b 0
)
if "%1" EQU "" (
    endlocal
    call :help
    exit /b 0
)


if not exist "%1" (
        echo file does not exists
    endlocal
    exit /b 54
)

if exist "%1\" (
        echo this cannot be used against directories
    endlocal
    exit /b 53
)

if "%~z1" EQU "0" (
    echo empty files are not accepted
    endlocal
    exit /b 52
)



set "file=%~snx1"
del /Q /F "%file%.hex" >nul 2>&1

certutil -f -encodehex %file% %file%.hex>nul

rem -- find the first line of hex file --

for /f "usebackq delims=" %%E in ("%file%.hex") do (
    set "f_line=%%E" > nul
        goto :enfdor
)
:enfdor
del /Q /F "%file%.hex" >nul 2>&1

rem -- check the BOMs --
echo %f_line% | find "ef bb bf"     >nul && echo utf-8     &&endlocal && exit /b 1
echo %f_line% | find "ff fe 00 00"  >nul && echo utf-32 LE &&endlocal && exit /b 5
echo %f_line% | find "ff fe"        >nul && echo utf-16    &&endlocal && exit /b 2
echo %f_line% | find "fe ff 00"     >nul && echo utf-16 BE &&endlocal && exit /b 3
echo %f_line% | find "00 00 fe ff"  >nul && echo utf-32 BE &&endlocal && exit /b 4

echo ASCII & endlocal & exit /b 6



endLocal
goto :eof

:help
echo.
echo  %~n0 file - Detects encoding of a text file
echo.
echo for each encoding you will recive a text responce with a name and a errorlevel codes as follows:

echo     1 - UTF-8
echo     2 - UTF-16 BE
echo     3 - UTF-16 LE
echo     4 - UTF-32 BE
echo     5 - UTF-32 LE
echo     6 - ASCII

echo for empty files you will receive error code 52
echo for directories  you will receive error code 53
echo for not existing file  you will receive error code 54
goto :eof