Windows command line/shell - discarding the UTF-8

2019-04-30 02:28发布

This question is in continuation to another question about selectively appending lines from one file to another.

The regex that I'm using works just fine at matching the lines to keep/to discard. The problem is that the file was composed from a bunch of other files, and sometimes the line I want to keep started out as the first line of a UTF-8 encoded file. This means that the findstr command returns something like:

∩╗┐LineToKeep that started out as the first line in its file
LineToKeep another
LineToKeep more lines
∩╗┐LineToKeep that started out as the first line in its file
LineToKeep more

It's guaranteed that excepting the BOM bytes, the line will always begin with "LineToKeep". How can I get rid of those three UTF-8 BOM bytes, since these windows shell commands can't properly handle them?

I'm hoping for a way to remove them in place, or perhaps a modification to the findstr command from that previous question.

Since I know each line must begin with "LineToKeep" or "∩╗┐LineToKeep", I figure there's a way to compute something like if (Line[3:10] == "LineToKeep") { Line = Line[3:]; } for every line.

标签： windows command-line batch-file

2条回答

Fickle 薄情

2楼-- · 2019-04-30 02:50

Another alternative from unix world that removes the BOM in file in-place:

sed -zbi "1s/^\xEF\xBB\xBF//" filepath

This requires to download sed 4.4 for windows from https://github.com/mbuilov/sed-windows which offers working -z and -b options which prevent corruption of line endings.

0人赞添加讨论(0) 举报

干净又极端

3楼-- · 2019-04-30 02:55

I ended up calling PowerShell in windows cmd:

powershell . "Get-ChildItem . | Select-String '^LineToKeep' | foreach {$_.Line}"

0人赞添加讨论(0) 举报

Windows command line/shell - discarding the UTF-8

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间