I need to check if one file is inside another file by bash script. For a given multiline pattern and input file.
Return value:
I want to receive status (how in grep command) 0 if any matches were found, 1 if no matches were found.
Pattern:
- multiline,
- order of lines is important (treated as a single block of lines),
- includes characters such as numbers, letters, ?, &, *, # etc.,
Explanation
Only the following examples should found matches:
pattern file1 file2 file3 file4
222 111 111 222 222
333 222 222 333 333
333 333 444
444
the following should't:
pattern file1 file2 file3 file4 file5 file6 file7
222 111 111 333 *222 111 111 222
333 *222 222 222 *333 222 222
333 333* 444 111 333
444 333 333
Here's my script:
#!/bin/bash
function writeToFile {
if [ -w "$1" ] ; then
echo "$2" >> "$1"
else
echo -e "$2" | sudo tee -a "$1" > /dev/null
fi
}
function writeOnceToFile {
pcregrep --color -M "$2" "$1"
#echo $?
if [ $? -eq 0 ]; then
echo This file contains text that was added previously
else
writeToFile "$1" "$2"
fi
}
file=file.txt
#1?1
#2?2
#3?3
#4?4
pattern=`cat pattern.txt`
#2?2
#3?3
writeOnceToFile "$file" "$pattern"
I can use grep command for all lines of pattern, but it fails with this example:
file.txt
#1?1
#2?2
#=== added line
#3?3
#4?4
pattern.txt
#2?2
#3?3
or even if you change lines: 2 with 3
file=file.txt
#1?1
#3?3
#2?2
#4?4
returning 0 when it should't.
How do I can fix it? Note that I prefer to use native installed programs (if this can be without pcregrep). Maybe sed or awk can resolve this problem?
I have a working version using perl.
I thought I had it working with GNU awk
, but I didn't. RS=empty string splits on blank lines. See the edit history for the broken awk version.
How can I search for a multiline pattern in a file? shows how to use pcregrep, but I can't see a way to get it to work when the pattern to search may contain regex special characters. -F
fixed-string mode doesn't usefully work with multi-line mode: it still treats the pattern as a set of lines to be matched separately. (Not as a multi-line fixed-string to be matched.) I see you were already using pcregrep in your attempt.
BTW, I think you have a bug in your code in the non-sudo case:
function writeToFile {
if [ -w "$1" ] ; then
"$2" >> "$1" # probably you mean echo "$2" >> "$1"
else
echo -e "$2" | sudo tee -a "$1" > /dev/null
fi
}
Anyway, attempts at using line-based tools have met with failure, so it's time to pull out a more serious programming language that doesn't force the newline convention on us. Just read both files into variables, and use a non-regex search:
#!/usr/bin/perl -w
# multi_line_match.pl pattern_file target_file
# exit(0) if a match is found, else exit(1)
#use IO::File;
use File::Slurp;
my $pat = read_file($ARGV[0]);
my $target = read_file($ARGV[1]);
if ((substr($target, 0, length($pat)) eq $pat) or index($target, "\n".$pat) >= 0) {
exit(0);
}
exit(1);
See What is the best way to slurp a file into a string in Perl? to avoid the dependency on File::Slurp
(which isn't part of the standard perl distro, or a default Ubuntu 15.04 system). I went for File::Slurp partly for readability of what the program is doing, for non-perl-geeks, compared to:
my $contents = do { local(@ARGV, $/) = $file; <> };
I was working on avoiding reading the full file into memory, with an idea from http://www.perlmonks.org/?node_id=98208. I think non-matching cases would usually still read the whole file at once. Also, the logic was pretty complex for handling a match at the front of the file, and I didn't want to spend a long time testing to make sure it was correct for all cases. Here's what I had before giving up:
#IO::File->input_record_separator($pat);
$/ = $pat; # pat must include a trailing newline if you want it to match one
my $fh = IO::File->new($ARGV[2], O_RDONLY)
or die 'Could not open file ', $ARGV[2], ": $!";
$tail = substr($fh->getline, -1); #fast forward to the first match
#print each occurence in the file
#print IO::File->input_record_separator while $fh->getline;
#FIXME: something clever here to handle the case where $pat matches at the beginning of the file.
do {
# fixme: need to check defined($fh->getline)
if (($tail eq '\n') or ($tail = substr($fh->getline, -1))) {
exit(0); # if there's a 2nd line
}
} while($tail);
exit(1);
$fh->close;
Another idea was to filter patterns and files to be searched through tr '\n' '\r'
or something, so they would all be single-lines. (\r
being a likely safe choice that wouldn't collide with anything already in a file or a pattern.)
I would just use diff
for this task:
diff pattern <(grep -f file pattern)
Explanation
So what you are doing is to check what lines from pattern
are in file
and then comparing this to pattern
itself. If they match, it means that pattern
is a subset of file
!
Tests
seq 10
is part of seq 20
! Let's check it:
$ diff <(seq 10) <(grep -f <(seq 20) <(seq 10))
$
seq 10
is not exactly inside seq 2 20
(1 is not in the second one):
$ diff -q <(seq 10) <(grep -f <(seq 2 20) <(seq 10))
Files /dev/fd/63 and /dev/fd/62 differ
I went through the problem again and I think awk
can handle this better:
awk 'FNR==NR {a[FNR]=$0; next}
FNR==1 && NR>1 {for (i in a) len++}
{for (i=last; i<=len; i++) {
if (a[i]==$0)
{last=i; next}
} status=1}
END {print status+0}' file pattern
The idea is:
- Read all the file file
in memory in an array a[line_number] = line
.
- Count the elements in the array.
- Loop through the file pattern
and check if the current line occurs in file
anytime between where the cursor is and the end of the file file
. If it matches, move the cursor to the position where it was found. If it did not, set the status to 1
- that is, there is a line in pattern
that did not occur in file
after the previous match.
- Print the status, that will be 0
unless it was set to 1
anytime before.
Test
They do match:
$ tail f p
==> f <==
222
333
555
==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' f p
0
They don't:
$ tail f p
==> f <==
333
222
555
==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' f p
1
With seq
:
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 2 20) <(seq 10)
1
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 20) <(seq 10)
0