反向换行符标记化在每行文件中的一个令牌? - Unix的(Reverse newline t

2019-10-30 04:10发布

如何使用Unix的分离符合令牌? 表明文件是tokenizable使用sedxargs

有没有办法做相反?

[在:]

some
sentences
are
like
this.

some
sentences
foo
bar
that

[OUT]:

some sentences are like this.
some sentences foo bar that

每一句唯一的分隔符是\n\n 。 我可以做的蟒蛇以下, 但有一个UNIX的办法吗?

def per_section(it):
  """ Read a file and yield sections using empty line as delimiter """
  section = []
  for line in it:
    if line.strip('\n'):
      section.append(line)
    else:
      yield ''.join(section)
      section = []
  # yield any remaining lines as a section too
  if section:
    yield ''.join(section)

print ["".join(i).replace("\n"," ") for i in per_section(codecs.open('outfile.txt','r','utf8'))]

[出:]

[u'some sentences are like this. ', u'some sentences foo bar that ']

Answer 1:

使用AWK是eaiser来处理这样的任务:

awk -v RS="" '{$1=$1}7' file

如果你想保持多个空格在每一行,你可以

awk -v RS="" -F'\n' '{$1=$1}7' file

你的榜样:

kent$  cat f
some
sentences
are
like
this.

some
sentences
foo
bar
that

kent$  awk -v RS=""  '{$1=$1}7' f   
some sentences are like this.
some sentences foo bar that


Answer 2:

你可以做awk命令如下:

awk -v RS="\n\n" '{gsub("\n"," ",$0);print $0}' file.txt 

设置记录分离器\n\n这意味着字符串的基团通过一个空行分隔的行中标记化。 现在,打印在更换所有的令牌后\n用空格字符。



Answer 3:

sed -n --posix 'H;$ {x;s/\n\([^[:cntrl:]]\{1,\}\)/\1 /gp;}' YourFile

基于空白的线分离,从而,每一个字符串可以在长度上不同也



文章来源: Reverse newline tokenization in one-token per line files? - Unix