Creating CSV of information extracted from filenam

2019-06-12 02:03发布

I have a little script that lists paths to all files in a directory and all subdirectories and parses each path on the list with regex in Perl.

#!/bin/sh
find * -type f | while read j; do
echo $j | perl -n -e '/\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/ && print "\"0\";\"$1$2$3\";\"$4\";\"$5\";$fl\""' >> bss.csv
echo | readlink -f -n "$j" >>bss.csv
echo \">>bss.csv
done

Output:

"0";"13957";"4121113";"2";"/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg"

I am using the readlink from GNU coreutils: -n suppresses newline at the end, -f performs canonicalization by recursively following symlinks on the path.

Problem is, when input string did not pass regex I have only line with file path.

How can I add condition to check if regex passed - show path, else - no. I broke my brain with various combinations, but didn't find any that work properly.

标签: perl shell
2条回答
冷血范
2楼-- · 2019-06-12 02:39

Description of solution

In Perl, use if (/…/) {…} else {…} instead of /…/ && …. Thus you can execute print if match is successful and some other code otherwise.

If this is not the problem and you only want to get rid of the readlink output and closing quote, you can call readlink from Perl using backticks.

Resulting code

I turned everything into a single Perl program, used File::Find instead of find command, assumed $fl at the end of print in Perl is a relict (ignored it) and used Cwd::realpath() to find canonical path of the file instead of readlink -f from GNU coreutils. If you still want to use readlink -f, feel free to change Cwd::realpath($_) to `readlink -f '$_'` (including the backticks!), but then it will not work for filenames containing a single-quote.

You should call this script as ./script-name starting-directory > bss.csv. If you put it in the directory you are examining, the output would contain it too, along with the bss.csv.

#!/usr/bin/perl
# Usage: ./$0 [<starting-directory>...]
use strict;
use warnings;
use File::Find;
use Cwd;
no warnings 'File::Find';

sub handleFile() {
    return if not -f;
    if ($File::Find::name =~ /\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/) {
        local $, = ';', $\ = "\n";
        print map "\"$_\"", 0, $1.$2.$3, $4, $5, Cwd::realpath($_);
    } else {
        print STDERR "File $File::Find::name did not match\n";
    }
}

find(\&handleFile, @ARGV ? @ARGV : '.');

For reference I also enclose polished version of the original program. It is calling readlink from Perl as I suggested above and really utilizes the -n option of Perl, avoiding the while read loop.

#!/bin/sh
find . -type f | perl -n -e 'm{/(\d{2})/(\d{2})/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?} && print qq{"0";"$1$2$3";"$4";"$5";"`readlink -f -n '\''$_'\''`"}' > bss.csv

Other remarks to the original code

  • The echo | before the readlink does nothing and should be removed. Readlink does not read its stdin.
  • Where does $fl at the end of print in Perl come from? I assume it is a relict.
  • Use of generic quotes like qq{} and thoughtful use of delimiters (e.g. in regex matching and other quote-like operators) can save you from quoting hell. I already used this tip above: /…/m{…} and "…"qq{…}. Thx, Slade! See perlop manpage for more info.
查看更多
SAY GOODBYE
3楼-- · 2019-06-12 02:56

If I understand you, you want to capture the following parts of the filename:

/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg
                           ~~ ~~ ~                ~~~ ~~~~~~~ ~
                           1  2  3                4   5       6

But your perl regex doesn't do that. Let's break it apart for better understanding.

/\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/

Sliced into pieces, this would be...

  • \/(\d{2}) - a slash then two digits (with the digits captured)
  • \/(\d{2}) - another slash and two digits
  • \/(\d) - one more slash and any number of digits
  • .*- - any run of characters until the final hyphen in the input string
  • ([a-zA-Z]+) - one or more alpha characters
  • (?:_(\d{1}))? - nonsensical (I think) construct matching an optional single digit that won't be captured (because it's inside a (?:...))

If you step through your filename, you'll see that there is nothing here to handle the second last string of digits.

I'd do this using simpler tools. Sed, for example:

[ghoti@pc ~]$ s="/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg"
[ghoti@pc ~]$ echo "$s" | sed -rne 's/.*/"&"/;h;s:.*/([0-9]{2})/([0-9]{2})/([0-9]+)[^[a-zA-Z]]*[^-]+-([0-9]+)(_([0-9]+))?.*:"0";"\1\2\3";"\4";"\6":;G;s/\n/;/;p'
"0";"13957";"4121113";"2";"/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg"
[ghoti@pc ~]$ 

I'll break up the sed script for easier reading:

  • s/.*/"&"/; - Put quotes around the filename.
  • h; - Store the filename in Sed's "hold" space, for future use...
  • s: - Start the big substitution...
    • .*/([0-9]{2})/([0-9]{2})/([0-9]+)[^[a-zA-Z]]*[^-]+-([0-9]+)(_([0-9]+))?.* - This is the pattern we want to match for substitution. Similar to what you did in Perl, obviously, but using ERE instead of PCRE.
    • :"0";"\1\2\3";"\4";"\6":; - The replacement pattern, with \n being replaced by the bracketed elements of the RE. Note that \5 is skipped in the replace string, as that subexpression is only being used for the match.
  • G; - Append the "hold" space to the pattern space
  • s/\n/;/; - and remove the newline between them.
  • p - Print the result.

Note that this solution, as is, assumes that all input lines match the pattern you're looking for. If that's not the case, then you may get unpredictable output, and should put some pattern matching into the script.

查看更多
登录 后发表回答