Shell program code about: regexp and file handling

2019-09-08 06:13发布

问题:

I'm writing this little program in shell:

#!/bin/bash

#***************************************************************
# Synopsis:
# Read from an inputfile each line, which has the following format:
#
# llnnn nnnnnnnnnnnnllll STRING lnnnlll n nnnn nnnnnnnnn nnnnnnnnnnnnnnnnnnnn ll ll   
#
# where:
# n is a <positive int>
# l is a <char> (no special chars)
# the last set of ll ll  could be:
#   - NV 
#   - PV 
#
# Ex:
# AVO01  000060229651AVON FOOD OF ARKHAM C A  S060GER   0  1110  000000022  00031433680006534689  NV  PV
#
# The program should check, for each line of the file, the following:
# I) If the nnn of character llnnn (beggining the line) is numeric,
#    this is, <int>
# II) If the character ll ll is NV (just one set of ll) then
#    copy that line in an outputfile, and add one to a counter. 
# III) If the character ll ll is NP (just one set of ll) then
#     copy that line in an outputfile, and add one to a counter.
# 
# NOTICE: could be just one ll. Ex: [...] NV [...]
#                                   [...] PV [...] 
#         or both Ex: [...] NV PV [...] 
#
#
# Execution (after generating the executable):
# ./ inputfile outputfileNOM outputfilePGP
#***************************************************************


# Check the number of arguments that could be passed.
if [[ ${#@} != 3 ]]; then
        echo "Error...must be: myShellprogram <inputfile> <outputfileNOM> <outputfilePGP>\n"
    exit
fi  

#Inputfile: is in position 1 on the ARGS
inputfile=$1 
#OutputfileNOM: is in position 2 on the ARGS
outputfileNOM=$2
#OutputfilePGP: is in position 3 on the ARGS
outputfilePGP=$3

#Main variables. Change if needed. 
# Flags the could appear in the <inputfile>
#
# ATTENTION!!!: notice that there is a white space
# before the characters, this is important when using
# the regular expression in the conditional:
# if [[  $line =~ $NOM ]]; then [...] 
#
# If the white space is NOT there it would match things like:
# ABCNV ... which is wrong!!
NOM=" NV"
PGP=" PV"
#Counters of ocurrences
countNOM=0;
countPGP=0;


#Check if the files exists and have the write/read permissions
if [[ -r $inputfile && -w $outputfileNOM && -w $outputfilePGP ]]; then
    #Read all the lines of the file.
    while read -r line  
        do
            code=${line:3:2} #Store the code (the nnn) of the "llnnn" char set of the inputfile

            #Check if the code is numeric
            if [[ $code =~ ^[0-9]+$ ]] ; then

                #Check if the actual line has the NOM flag
                if [[  $line =~ $NOM ]]; then
                    echo "$line" >> "$outputfileNOM"
                    (( ++countNOM ))
                fi  

                #Check if the actual line has the PGP flag
                if [[  $line =~ $PGP ]]; then
                    echo "$line" >> "$outputfilePGP"
                    (( ++countPGP ))
                fi

            else
              echo "$code is not numeric"
              exit  

            fi      

        done < "$inputfile"

    echo "COUN NON $countNOM"       
    echo "COUN PGP $countPGP"
else
    echo "FILE: $inputfile does not exist or does not have read permissions"
    echo "FILE: $outputfileNOM does not exist or does not have write permissions"
    echo "FILE: $outputfilePGP does not exist or does not have write permissions"
fi  

I have some questions:

I) When I do:

 if [[ -r $inputfile && -w $outputfileNOM && -w $outputfilePGP ]]; then
 [...]
 else
     echo "FILE: $inputfile does not exist or does not have read permissions"
     echo "FILE: $outputfileNOM does not exist or does not have write permissions"
     echo "FILE: $outputfilePGP does not exist or does not have write permissions"
 fi

I would like to print the things on the else, accordingly, this is, print the right message. Ex: if "$outputfileNOM" did not have the write permission, just print that error. BUT, I don't want to put a lot of if/else, Ex:

if [[ -r $inputfile ]]; then
[...]
if  [[-w $outputfileNOM ]] then 
[...]
else
  For the READ permission, and the other else for the WRITE

Is there a way to do it, without using a nesting approach, and that maintains the readability.

II) About the:

 if [[ -r $inputfile && -w $outputfileNOM && -w $outputfilePGP ]]

is OK if I use the flag "-x" instead of -r or -w. I don't have a clear definition of what is the meaning of:

-x FILE
          FILE exists and execute (or search) permission is granted

III) Notice the ATTENTION label in my code. I notice that there are some possibilities, for ex: having white spaces before, after or before or after. I'm believing in the consistency of the input files, but if they change, it will explode. What could I do in this case? Is there an elegant way to manage it? (exceptions?)

Thank you very much!

回答1:

I've been bitten by the =~ operator before.

In principle I'd tell you to quote the argument (i.e. ... =~ "$NOM"), but starting with bash 3.2 there is a special behavior with =~ "". The link () which is rather wordy says:

o Quoting the string argument to the [[ command's =~ (regexp) operator now forces string matching, as with the other pattern-matching operators.

and

E14) Why does quoting the pattern argument to the regular expression matching conditional operator (=~) cause regexp matching to stop working?

In versions of bash prior to bash-3.2, the effect of quoting the regular expression argument to the [[ command's =~ operator was not specified. The practical effect was that double-quoting the pattern argument required backslashes to quote special pattern characters, which interfered with the backslash processing performed by double-quoted word expansion and was inconsistent with how the == shell pattern matching operator treated quoted characters.

In bash-3.2, the shell was changed to internally quote characters in single- and double-quoted string arguments to the =~ operator, which suppresses the special meaning of the characters special to regular expression processing (.',[', \',(', ),*', +',?', {',|', ^', and$') and forces them to be matched literally. This is consistent with how the `==' pattern matching operator treats quoted portions of its pattern argument.

Since the treatment of quoted string arguments was changed, several issues have arisen, chief among them the problem of white space in pattern arguments and the differing treatment of quoted strings between bash-3.1 and bash-3.2. Both problems may be solved by using a shell variable to hold the pattern. Since word splitting is not performed when expanding shell variables in all operands of the [[ command, this allows users to quote patterns as they wish when assigning the variable, then expand the values to a single string that may contain whitespace. The first problem may be solved by using backslashes or any other quoting mechanism to escape the white space in the patterns.

You might consider something along the lines of NOM="[ ]NV". (Note that I've not tested this.)



回答2:

Well, thank to the people that helped me. With their suggestions I will answer my own questions:

About:

I) Although this solution use conditionals, is very elegant:

#File error string
estr='ERROR: %s files does no exist or does not have %s permissions.\n'  

#Check if the files exists and have the write/read permissions
[ -r $inputfile ] || { printf "$estr" "<$inputfile>" "read" && exit; }
[ -w $outputfileNOM ] || { printf "$estr" "<$outputfileNOM>" "write" && exit; }
[ -w $outputfilePGP ] || { printf "$estr" "<$outputfilePGP>" "write" && exit; }

Notice the ; after the exit!

II) From the manual of chmod:

The letters rwxXst select file mode bits for the affected users: read (r), write (w), execute (or search for directories) (x) ...

And from Wikipedia (Filesystem Permissions):

The read permission, which grants the ability to read a file. When set for a directory, this permission grants the ability to read the names of files in the directory (but not to find out any further information about them such as contents, file type, size, ownership, permissions, etc.)

The write permission, which grants the ability to modify a file. When set for a directory, this permission grants the ability to modify entries in the directory. This includes creating files, deleting files, and renaming files.

The execute permission, which grants the ability to execute a file. This permission must be set for executable binaries (for example, a compiled C++ program) or shell scripts (for example, a Perl program) in order to allow the operating system to run them. When set for a directory, this permission grants the ability to traverse its tree in order to access files or subdirectories, but not see the content of files inside the directory (unless read is set).

III) Thanks to @dmckee for the link and to the turtle.

# ATTENTION!!!: notice the \< and \> surrounding
# the characters, this is important when using
# the regular expression in the conditional:
# if [[  $line =~ $NOM ]]; then [...]
#
# If those characters are NOT there it would match things like:
# ABCNV ... which is wrong!!
# They (the \< and \>) indicate that the 'NV' can't be 
# contained in another word.
NOM='\<NV\>'
PGP='\<PV\>'


标签: linux shell