How to use sed/grep to extract text between two wo

2019-01-01 00:32发布

问题:

I am trying to output a string that contains everything between two words of a string:

input:

\"Here is a String\"

output:

\"is a\"

Using:

sed -n \'/Here/,/String/p\'

includes the endpoints, but I don\'t want to include them.

回答1:

sed -e \'s/Here\\(.*\\)String/\\1/\'


回答2:

Simple grep can also support positive & negative look-ahead & look-back: For your case, the command would be:

 echo \"Here is a string\" | grep -o -P \'(?<=Here).*(?=string)\'


回答3:

You can strip strings in Bash alone:

$ foo=\"Here is a String\"
$ foo=${foo##*Here }
$ echo \"$foo\"
is a String
$ foo=${foo%% String*}
$ echo \"$foo\"
is a
$

And if you have a GNU grep that includes PCRE, you can use a zero-width assertion:

$ echo \"Here is a String\" | grep -Po \'(?<=(Here )).*(?= String)\'
is a


回答4:

The accepted answer does not remove text that could be before Here or after String. This will:

sed -e \'s/.*Here\\(.*\\)String.*/\\1/\'

The main difference is the addition of .* immediately before Here and after String.



回答5:

Through GNU awk,

$ echo \"Here is a string\" | awk -v FS=\"(Here|string)\" \'{print $2}\'
 is a 

grep with -P(perl-regexp) parameter supports \\K, which helps in discarding the previously matched characters. In our case , the previously matched string was Here so it got discarded from the final output.

$ echo \"Here is a string\" | grep -oP \'Here\\K.*(?=string)\'
 is a 
$ echo \"Here is a string\" | grep -oP \'Here\\K(?:(?!string).)*\'
 is a 

If you want the output to be is a then you could try the below,

$ echo \"Here is a string\" | grep -oP \'Here\\s*\\K.*(?=\\s+string)\'
is a
$ echo \"Here is a string\" | grep -oP \'Here\\s*\\K(?:(?!\\s+string).)*\'
is a


回答6:

If you have a long file with many multi-line ocurrences, it is useful to first print number lines:

cat -n file | sed -n \'/Here/,/String/p\'


回答7:

This might work for you (GNU sed):

sed \'/Here/!d;s//&\\n/;s/.*\\n//;:a;/String/bb;$!{n;ba};:b;s//\\n&/;P;D\' file 

This presents each representation of text between two markers (in this instance Here and String) on a newline and preserves newlines within the text.



回答8:

All the above solutions have deficiencies where the last search string is repeated elsewhere in the string. I found it best to write a bash function.

    function str_str {
      local str
      str=\"${1#*${2}}\"
      str=\"${str%%$3*}\"
      echo -n \"$str\"
    }

    # test it ...
    mystr=\"this is a string\"
    str_str \"$mystr\" \"this \" \" string\"


回答9:

You can use \\1 (refer to http://www.grymoire.com/Unix/Sed.html#uh-4):

echo \"Hello is a String\" | sed \'s/Hello\\(.*\\)String/\\1/g\'

The contents that is inside the brackets will be stored as \\1.



回答10:

Problem. My stored Claws Mail messages are wrapped as follows, and I am trying to extract the Subject lines:

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

Per A2 in this thread, How to use sed/grep to extract text between two words? the first expression, below, \"works\" as long as the matched text does not contain a newline:

grep -o -P \'(?<=Subject: ).*(?=molecular)\' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

However, despite trying numerous variants (.+?; /s; ...), I could not get these to work:

grep -o -P \'(?<=Subject: ).*(?=link)\' corpus/01
grep -o -P \'(?<=Subject: ).*(?=therapeutic)\' corpus/01
etc.

Solution 1.

Per Extract text between two strings on different lines

sed -n \'/Subject: /{:a;N;/Message-ID:/!ba; s/\\n/ /g; s/\\s\\s*/ /g; s/.*Subject: \\|Message-ID:.*//g;p}\' corpus/01

which gives

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]                              

Solution 2.*

Per How can I replace a newline (\\n) using sed?

sed \':a;N;$!ba;s/\\n/ /g\' corpus/01

will replace newlines with a space.

Chaining that with A2 in How to use sed/grep to extract text between two words?, we get:

sed \':a;N;$!ba;s/\\n/ /g\' corpus/01 | grep -o -P \'(?<=Subject: ).*(?=Message-ID:)\'

which gives

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]] 

This variant removes double spaces:

sed \':a;N;$!ba;s/\\n/ /g; s/\\s\\s*/ /g\' corpus/01 | grep -o -P \'(?<=Subject: ).*(?=Message-ID:)\'

giving

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]


回答11:

To understand sed command, we have to build it step by step.

Here is your original text

user@linux:~$ echo \"Here is a String\"
Here is a String
user@linux:~$ 

Let\'s try to remove Here with substition option in sed

user@linux:~$ echo \"Here is a String\" | sed \'s/Here //\'
is a String
user@linux:~$ 

At this point, I believe you would be able to remove String as well

user@linux:~$ echo \"Here is a String\" | sed \'s/String//\'
Here is a
user@linux:~$ 

But this is not your desired output.

To combine two sed commands, use -e option

user@linux:~$ echo \"Here is a String\" | sed -e \'s/Here //\' -e \'s/String//\'
is a
user@linux:~$ 

Hope this helps