grep: match all characters up to (not including) f

2020-05-30 06:50发布

I have a text file that has the following format:

characters(that I want to keep) (space) characters(that I want to remove)

So for example:

foo garbagetext
hello moregarbage
keepthis removethis
(etc.)

So I was trying to use the grep command in Linux to keep only the characters in each line up to and not including the first blank space. I have tried numerous attempts such as:

grep '*[[:space:]]' text1.txt > text2.txt
grep '*[^\s]' text1.txt > text2.txt
grep '/^[^[[:space:]]]+/' text1.txt > text2.txt

trying to piece together from different examples, but I have had no luck. They all produce a blank text2.txt file. I am new to this. What am I doing wrong?

*EDIT:

The parts I want to keep include capital letters. So I want to keep any/all characters up to and not including the blank space (removing everything from the blank space onward) in each line.

**EDIT:

The garbage text (that I want to remove) can contain anything, including spaces, special characters, etc. So for example:

AA rough, cindery lava [n -S]

After running grep -o '[^ ]*' text1.txt > text2.txt, the line above becomes:

AA
rough,
cindery
lava
[n
-S]

in text2.txt. (All I want to keep is AA)


SOLUTION (provided by Rohit Jain with further input by beny23):

 grep -o '^[^ ]*' text1.txt > text2.txt

4条回答
爷、活的狠高调
2楼-- · 2020-05-30 07:28

Following up on the answer by @Steve, if you want to use a different separator (eg. comma), you can specify it using -F. This will be useful if you want the content of each line upto the first comma, such as when trying to read the value of the first field in a csv file.

$ awk -F "," '{print $1}' text1.txt > text2.txt
查看更多
\"骚年 ilove
3楼-- · 2020-05-30 07:38

I use egrep a lot to help "colorize" log lines, so I'm always looking for a new twist on regex. For me, the above works better by adding a \W like this:

$ egrep --color '^\S*\W|bag' /tmp/barf -o
foo
bag
hello
bag
keepthis
(etc.)

Problem is, my log files almost always are time-stamped, so I added a line to the example file:

2013-06-11 date stamped line

and then it doesn't work so well. So I reverted to my previous regex:

egrep --color '^\w*\b|bag' /tmp/barf

but the non-date-stamped lines revealed problems with that. It is hard to see this without colorization...

查看更多
Anthone
4楼-- · 2020-05-30 07:41

I realize this has long since been answered with the grep solution, but for future generations I'd like to note that there are at least two other solutions for this particular situation, both of which are more efficient than grep.

Since you are not doing any complex text pattern matching, just taking the first column delimited by a space, you can use some of the utilities which are column-based, such as awk or cut.

Using awk

$ awk '{print $1}' text1.txt > text2.txt

Using cut

$ cut -f1 -d' ' text1.txt > text2.txt

Benchmarks on a ~1.1MB file

$ time grep -o '^[^ ]*' text1.txt > text2.txt

real    0m0.064s
user    0m0.062s
sys     0m0.001s
$ time awk '{print $1}' text1.txt > text2.txt

real    0m0.021s
user    0m0.017s
sys     0m0.004s
$ time cut -f1 -d' ' text1.txt > text2.txt

real    0m0.007s
user    0m0.004s
sys     0m0.003s

awk is about 3x faster than grep, and cut is about 3x faster than that. Again, there's not much difference for this small file for just one run, but if you're writing a script, e.g., for re-use, or doing this often on large files, you might appreciate the extra efficiency.

查看更多
劳资没心,怎么记你
5楼-- · 2020-05-30 07:50

You are putting quantifier * at the wrong place.

Try instead this: -

grep '^[^\s]*' text1.txt > text2.txt

or, even better: -

grep '^\S*' text1.txt > text2.txt  

\S means match non-whitespace character. And anchor ^ is used to match at the beginning of the line.

查看更多
登录 后发表回答