Windows command line/shell - discarding the UTF-8

2019-04-30 02:28发布

This question is in continuation to another question about selectively appending lines from one file to another.

The regex that I'm using works just fine at matching the lines to keep/to discard. The problem is that the file was composed from a bunch of other files, and sometimes the line I want to keep started out as the first line of a UTF-8 encoded file. This means that the findstr command returns something like:

LineToKeep that started out as the first line in its file
LineToKeep another
LineToKeep more lines
LineToKeep that started out as the first line in its file
LineToKeep more

It's guaranteed that excepting the BOM bytes, the line will always begin with "LineToKeep". How can I get rid of those three UTF-8 BOM bytes, since these windows shell commands can't properly handle them?

I'm hoping for a way to remove them in place, or perhaps a modification to the findstr command from that previous question.

Since I know each line must begin with "LineToKeep" or "LineToKeep", I figure there's a way to compute something like if (Line[3:10] == "LineToKeep") { Line = Line[3:]; } for every line.

2条回答
Fickle 薄情
2楼-- · 2019-04-30 02:50

Another alternative from unix world that removes the BOM in file in-place:

sed -zbi "1s/^\xEF\xBB\xBF//" filepath

This requires to download sed 4.4 for windows from https://github.com/mbuilov/sed-windows which offers working -z and -b options which prevent corruption of line endings.

查看更多
干净又极端
3楼-- · 2019-04-30 02:55

I ended up calling PowerShell in windows cmd:

powershell . "Get-ChildItem . | Select-String '^LineToKeep' | foreach {$_.Line}"
查看更多
登录 后发表回答