I am facing few problems while extracting block of lines from a file. consider following two files
File-1
1.20/abc/this_is_test_1
perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2
exec perl/RRP/RRP-1.30/JEDI/CommonReq/confAbvExp
perl/LRP/BaseLibs/close-MMM
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE")
this/or/that
File-2
exec 1.20/setup/testird
exec 1.20/sql/temp/Test3
exec 1.20/setup/testxyz
exec 1.20/sql/fondle_opr_sql_labels
exec 1.20/setup/testird
exec 1.20/sql/temp/NEWTest
exec 1.20/setup/testxyz
exec 1.20/sql/fondle_opr_sql_xfer
exec 1.20/setup/testird
exec 1.20/sql/set_sec_not_0
exec 1.20/setup/testpqr
exec 1.20/sql/sql_ba_statuses_on_mult
exec perl/RRP/SetupReq/testdef_ijk
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess1
exec perl/RRP/SetupReq/testdef_ijk
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2
exec perl/RRP/SetupReq/testdef_ijk
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess3
exec 1.20/setup/testird
exec 1.20/sql/sqlmenu_purr_labl
exec 1.20/sql/est_time_at_non_drp_plc
exec 1.20/sql/half_Brd_Supply_mix_single
exec 1.20/setup/testird
exec 1.20/sql/temp/Test
exec 1.20/setup/testird
exec 1.20/sql/temp/Test2
exec perl/LRP/SetupReq/testird_LRP("LRP")
exec perl/BaseLibs/launch_client("LRP")
exec perl/LRP/LRP-classic-4.14/churrip/chorSingle
exec perl/LRP/BaseLibs/setupLRPMMMTab
exec perl/LRP/BaseLibs/launchMMM
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE")
#PAUSE Expand Churrip tree view & open all nodes
exec perl/LRP/LRP-classic-4.14/Corrugator/multipleSeriesWeb
exec perl/BaseLibs/ShutApp("Self Destruction System")
exec perl/LRP/BaseLibs/close-MMM
exec 1.20/setup/testmiddle
exec 1.20/sql/collective_reads
exec 1.20/setup/testinit
exec 1.20/abc/this_is_test_1
exec 1.20/abc/this_is_test_1
exec perl/LRP/SetupReq/abcDEF
exec perl/BaseLibs/launch_client("sqlC","LRP")
exec perl/LRP/LRP-perl-4.20/fireTrigger
Now for every line in File-1 i want to extract relevant block of lines from File-2. A block in File-2 is defined as below
exec 1.20/setup/xxxxx
blah blah blah
blah blah blah
.
.
.
all lines till next setup line is found
for example
exec 1.20/setup/testinit
exec 1.20/abc/this_is_test_1
exec 1.20/abc/this_is_test_1
or
exec perl/LRP/SetupReq/xxxxx
blah blah blah
blah blah blah
.
.
.
all lines till next setup line is found
for example
exec perl/LRP/SetupReq/testird_LRP("LRP")
exec perl/BaseLibs/launch_client("LRP")
exec perl/LRP/LRP-classic-4.14/churrip/chorSingle
exec perl/LRP/BaseLibs/setupLRPMMMTab
exec perl/LRP/BaseLibs/launchMMM
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE")
#PAUSE Expand Churrip tree view & open all nodes
exec perl/LRP/LRP-classic-4.14/Corrugator/multipleSeriesWeb
exec perl/BaseLibs/ShutApp("Self Destruction System")
exec perl/LRP/BaseLibs/close-MMM
I have so far managed to extract relevant blocks from File-2 with help of following script
Shell Script
#set -x
FLBATCHLIST=$1
BATCHFILE=$2
TEMPDIR="/usr/tmp/tempBatchDir"
rm -rf $TEMPDIR/*
WORKFILE="$TEMPDIR/failedTestList.txt"
CPBATCHFILE="$TEMPDIR/orig.test"
TESTSETFILE="$TEMPDIR/testset.txt"
TEMPFILE="$TEMPDIR/temp.txt"
DIFFFILE="$TEMPDIR/diff.txt"
#Output
FAILEDBATCH="$TEMPDIR/FailedBatch.test"
LOGFILE="$TEMPDIR/log.txt"
createBatch ()
{
TESTNAME=$1
#First process the $CPBATCHFILE to not have any blank lines, leading and trailing whitespaces
# delete BOTH leading and trailing whitespace from each line and blank lines from file
sed -i 's/^[[:space:]]*//;s/[[:space:]]*$//g;/^$/d' $CPBATCHFILE
FOUND=0
STATUS=1
while [ $STATUS -ne "0" ]
do
if [ ! -s $CPBATCHFILE ]; then
echo "$CPBATCHFILE is empty" >> $LOGFILE
STATUS=0
fi
awk '/[Ss]etup.*[Tt]est/ || /perl\/[[:alpha:]]*\/[Ss]etup[rR]eq/{if(b) exit; else b=1}1' $CPBATCHFILE > $TESTSETFILE
grep -i "$TESTNAME$" $TESTSETFILE >> $LOGFILE 2>&1
if [ $? -eq "0" ]; then
echo "test found" >> $LOGFILE
cat $TESTSETFILE >> $FAILEDBATCH
FOUND=1
fi
TSTFLLINES=`wc -l < $TESTSETFILE`
CPBTCHLINES=`wc -l < $CPBATCHFILE`
DIFF=`expr $CPBTCHLINES - $TSTFLLINES`
tail -n $DIFF $CPBATCHFILE > $DIFFFILE
mv $DIFFFILE $CPBATCHFILE
done
if [ $FOUND -eq 0 ]; then
echo $TESTNAME > $TEMPDIR/test.txt
ABSTEST=$(echo $TESTNAME | sed 's/\\//g')
echo "FATAL ERROR: Test \"$ABSTEST\" not found in batch" | tee -a $LOGFILE
fi
}
####STARTS HERE####
mkdir -p $TEMPDIR
#cat $TEMPDIR/test.txt
#FLBATCHLIST="$TEMPDIR/test.txt"
# delete run, BOTH leading and trailing whitespace and blank lines from file
sed 's/^[eE][xX][eE][cC]//g;s/^[[:space:]]*//;s/[[:space:]]*$//g;/^$/d' $FLBATCHLIST > $WORKFILE
# escaping special characters like '\' and '.' in the path names for better grepping
sed -i 's/\([\/\.\"]\)/\\\1/g' $WORKFILE
for fltest in $(cat $WORKFILE)
do
echo $fltest >> $LOGFILE
cp $BATCHFILE $CPBATCHFILE
createBatch $fltest
done
sed -i 's/\//\\/g' $FAILEDBATCH
## Clean up
cp $FAILEDBATCH .
THe problem with this script is
It takes some time as it traverses File-2 for each line of File-1. I wanted to know if there is any better solution where i just have to traverse File-2 once.
The script does solve my problem but I am left with file which has duplicate blocks of lines in it. I wanted to know is there a way to remove the duplicate blocks of lines.
This is my output when i execute the script
exec 1.20\setup\testinit
exec 1.20\abc\this_is_test_1
exec 1.20\abc\this_is_test_1
exec perl\RRP\SetupReq\testdef_ijk
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess2
exec perl\RRP\SetupReq\testdef_ijk
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess1
exec perl\RRP\SetupReq\testdef_ijk
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess2
exec perl\RRP\SetupReq\testdef_ijk
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess3
exec perl\LRP\SetupReq\testird_LRP("LRP")
exec perl\BaseLibs\launch_client("LRP")
exec perl\LRP\LRP-classic-4.14\churrip\chorSingle
exec perl\LRP\BaseLibs\setupLRPMMMTab
exec perl\LRP\BaseLibs\launchMMM
exec perl\LRP\BaseLibs\launchLRPCHURRTA("TYRE")
#PAUSE Expand Churrip tree view & open all nodes
exec perl\LRP\LRP-classic-4.14\Corrugator\multipleSeriesWeb
exec perl\BaseLibs\ShutApp("Self Destruction System")
exec perl\LRP\BaseLibs\close-MMM
exec perl\LRP\SetupReq\testird_LRP("LRP")
exec perl\BaseLibs\launch_client("LRP")
exec perl\LRP\LRP-classic-4.14\churrip\chorSingle
exec perl\LRP\BaseLibs\setupLRPMMMTab
exec perl\LRP\BaseLibs\launchMMM
exec perl\LRP\BaseLibs\launchLRPCHURRTA("TYRE")
#PAUSE Expand Churrip tree view & open all nodes
exec perl\LRP\LRP-classic-4.14\Corrugator\multipleSeriesWeb
exec perl\BaseLibs\ShutApp("Self Destruction System")
exec perl\LRP\BaseLibs\close-MMM
I tried searching for my answers over net but wasn't able to find one specific to my needs.
Given File-1 and File-2 Here is what i expect my script to output (I have listed what output i expect for each line in FILE-1)
For line "1.20/abc/this_is_test_1" in FILE-1
Output
exec 1.20/setup/testinit
exec 1.20/abc/this_is_test_1
exec 1.20/abc/this_is_test_1
For line "perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2" in FILE-1
Output
exec perl/RRP/SetupReq/testdef_ijk
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2
For line "exec perl/RRP/RRP-1.30/JEDI/CommonReq/confAbvExp" in FILE-1
Output
do nothing as there is no line matching this is in FILE-2
For line "perl/LRP/BaseLibs/close-MMM" in FILE-1
Output
exec perl/LRP/SetupReq/testird_LRP("LRP")
exec perl/BaseLibs/launch_client("LRP")
exec perl/LRP/LRP-classic-4.14/churrip/chorSingle
exec perl/LRP/BaseLibs/setupLRPMMMTab
exec perl/LRP/BaseLibs/launchMMM
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE")
#PAUSE Expand Churrip tree view & open all nodes
exec perl/LRP/LRP-classic-4.14/Corrugator/multipleSeriesWeb
exec perl/BaseLibs/ShutApp("Self Destruction System")
exec perl/LRP/BaseLibs/close-MMM
For line "exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE")" in FILE-1
Output
Do nothing as it would generate the same black as line "perl/LRP/BaseLibs/close-MMM" in FILE-1 did
For Line "this/or/that" in FILE-1
Output
Do nothing as there is no line matching this is in FILE-2
SO my final output should be similiar (order of blocks doesn't matter) to
exec 1.20/setup/testinit
exec 1.20/abc/this_is_test_1
exec 1.20/abc/this_is_test_1
exec perl/RRP/SetupReq/testdef_ijk
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2
exec perl/LRP/SetupReq/testird_LRP("LRP")
exec perl/BaseLibs/launch_client("LRP")
exec perl/LRP/LRP-classic-4.14/churrip/chorSingle
exec perl/LRP/BaseLibs/setupLRPMMMTab
exec perl/LRP/BaseLibs/launchMMM
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE")
#PAUSE Expand Churrip tree view & open all nodes
exec perl/LRP/LRP-classic-4.14/Corrugator/multipleSeriesWeb
exec perl/BaseLibs/ShutApp("Self Destruction System")
exec perl/LRP/BaseLibs/close-MMM
It would be really great if anyone can give me some pointers on how to proceed. And yes i forgot to mention, this is not a homework question :-) .
Many thanks
Provided that line order does not matter, you can remove duplicates from a file this way, from th ecommand prompt:
To find which lines are present in both files, I used a perl script that created a hash (or associative array, if you will). Then i scanned through File A, added each line to the hash, using the line as the key, and setting the value to 1. Then i did the same for File A, but setting the value to 2, and if the key already existed, i added 2 instead. The result would be going through each file only once, and at the end i knew that if the key had a value of 1, it only existed in File A, if it had a value of 2, it existed only in File B, and if it had a value of 3, it existed in both.
Edit: I found some perl code laying around from a project, doing exactly what i described above. In this code, i was only after the differences, but it should be easy to modify it to your needs
Thanks @tripleee and @Jarmund for your suggestions. From your inputs I was finally able to figure out solution of my problem. I got hint from associative arrays to make unique key for each block, so here is what i did
take file-2 and convert each block into single line
awk '/[Ss]etup.[Tt]est/ || /perl/[[:alpha:]]/[Ss]etup[Rr]eq/{if(b) exit; else b=1}1' file-2 > $TESTSETFILE cat $TESTSETFILE | sed ':a;N;$!ba;s/\n//g;s/ //g' >> $SINGLELINEFILE
Now each line in this file is a unique entry
Maybe this solution is not best but it is lot faster than the previous one.
Here is my new script
Here is the output
The following assumes that the "setup" line is unique for each block. We use this line as a key to an associative array which keeps track of which blocks we have already printed.
The first line of script reads the first file into a variable called
regex
which collects the lines we want to match on from the first file (the idiomNR==FNR
means when the line number of the current file is equal to the line number of all files collected, that is, it is true only when we are reading the first file from the argument list). The rest of the script is fairly straightforward, I should hope.If the "setup" line is not necessarily unique, maybe you can use the value of
collected
as a key.This should (I hope) be robust against multiple lines from
File-1
matching the same block inFile-2
.I know I hinted at a
sed
solution in a comment but this turned out to be the sort of problem whereawk
felt more natural. Of course, it could be done in Perl or Python or what have you just as well.