I have a database unload file with field separated with the <TAB> character. I am running this file through sed to replace any occurences of <TAB><TAB> with <TAB>\N<TAB>. This is so that when the file is loaded into MySQL the \N in interpreted as NULL.

The sed command 's/\t\t/\t\N\t/g;' almost works except that it only replaces the first instance e.g. "...<TAB><TAB><TAB>..." becomes "...<TAB>\N<TAB><TAB>...".

If I use 's/\t\t/\t\N\t/g;s/\t\t/\t\N\t/g;' it replaces more instances.

I have a notion that despite the /g modifier this is something to do with the end of one match being the start of another.

Could anyone explain what is happening and suggest a sed command that would work or do I need to loop.

I know I could probably switch to awk, perl, python but I want to know what is happening in sed.

标签： shell unix sed

5条回答

Explanation

:repeat is a label, used for branch commands, similar to batch
/\t\t/ means match the pattern 2 tabs. If the pattern it matched, the command following the second / is executed.
{} - In this case the command following the match command is a group. So all of the commands in the group are executed if the match pattern is met.
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global because if you have say 15 tabs, you will only need to loop twice, rather than 14 times.
b repeat means always goto (branch) the label repeat

So it goes like this. Keep repeating (goto repeat) as long as there is a match for the pattern of 2 tabs.

While the argument can be made that you could just do two identical global replaces and call it good, this same technique could work in more complicated scenarios.

As @thorn-blake points out, sed just doesn't support advanced features like lookahead, so you need to do a loop like this.

Short version

Which can be shortened to

sed ':r;/\t\t/{s|\t\t|\t\n\t|g; b r}'

MacOS

And the Mac (yet still Linux/Windows compatible) version:

sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; b r\n}'

Tabs need to be literal in BSD sed
Newlines need to be both literal and escaped at the same time, hence the single slash (that's \ before it is processed by the $, making it a single literal slash ) plus the \n which becomes an actual newline
Both label names (:r) and branch commands (b r) must end in a newline. semicolons and spaces are consumed by the label name/branch command in BSD, which makes it all very confusing.

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2019-06-19 17:46

As a workaround, replace every tab with tab + \N; then remove all occurrences of \N which are not immediately followed by a tab.

sed -e 's/\t/\t\\N/g' -e 's/\\N\([^\t]\)/\1/g'

... provided your sed uses backslash before grouping parentheses (there are sed dialects which don't want the backslashes; try without them if this doesn't work for you.)

0人赞添加讨论(0) 举报

\"骚年 ilove

4楼-- · 2019-06-19 17:46

Right, even with /g, sed will not match the text it replaced again. Thus, it's read <TAB><TAB> and output <TAB>\N<TAB> and then reads the next thing in from the input stream. See http://www.grymoire.com/Unix/Sed.html#uh-7

In a regex language that supports lookaheads, you can get around this with a lookahead.

0人赞添加讨论(0) 举报

女痞

5楼-- · 2019-06-19 17:48

Well, sed simply works as designed. The input line is scanned once, not multiple times. Maybe it helps to look at the consequences if sed used rescanning the input line to deal with overlapping patterns by default: in this case even simple substitutions would work quite differently--some might say counter-intuitively--, e.g.

s/^/ / inserting a space at the beginning of a line would never terminate
s/$/foo/ appending foo to each line - likewise
s/[A-Z][A-Z]*/CENSORED/ replacing uppercase words with CENSORED - likewise

There are probably many other situations. Of course these could all be remedied with, say, a substitution modifier, but at the time sed was designed, the current behavior was chosen.

0人赞添加讨论(0) 举报

Why does sed not replace overlapping patterns

Explanation

Short version

MacOS

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间