bash script to navigate directory substructure and

2019-09-07 10:09发布

I tired this:

for dir in /home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2007/02/*/
    for f in *.xml ; do
        echo $f | grep -q '_output\.xml$' && continue # skip output files
        g="$(basename $f .xml)_output.xml"
        java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
    done
done

which is based on the answer to this question, but that didn't work.

I have a folder stucture such that within the directory NYTimesCorpus there is a directory 2007 and within that a directory 01 and also 02, 03, and so on...

then within 01 there is again 01,02,03,...

in each of these terminal directories there are many .xml files to which I want to apply the script:

for f in *.xml ; do
    echo $f | grep -q '_output\.xml$' && continue # skip output files
    g="$(basename $f .xml)_output.xml"
    java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
done

but there are so many different directories that running it within each dirctory is a form of rare torture. apart from 2007 I also have 2006 and 2005, so ideally what I would like to do is run it once and have the program just navigate that structure on its own.

My attempts this far have not been successful, perhaps one among you would know how to achieve this?

Thank you for your consideration.

UPDATE

textFile=./scrypt.sh
outputFormat=inlineXML
Loading classifier from /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz ... done [2.2 sec].
CRFClassifier tagged 71 words in 5 documents at 959.46 words per second.
CRFClassifier invoked on Sun Apr 12 19:33:34 HKT 2015 with arguments:
   -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile ./scrypt.sh -outputFormat inlineXML
    loadClassifier=/home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz

2条回答
该账号已被封号
2楼-- · 2019-09-07 10:15

I would use find since it works recursively:

find /path/to/xmls -type f ! -name '*_output.xml' -name '*.xml' -exec ./script.sh {} \;

For better readability I would save the actions that should be executed on each file to script.sh:

#!/bin/bash

f="$1"
g="${f%%.*}_output.xml"
java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile "$f" -outputFormat inlineXML > "$g"

and make it executable:

chmod +x script.sh
查看更多
霸刀☆藐视天下
3楼-- · 2019-09-07 10:34

find is a good solution. It sounds like all the xml files are at the same directory depth, so try this:

dir=/home/matthias/Workbench/SUTD/nytimes_corpus
for f in $dir/NYTimesCorpus/*/*/*/*.xml; do
    [[ $f == *_output.xml ]] && continue # skip output files
    g="${f%.xml}_output.xml"
    java -mx600m \
         -cp $dir/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar \
         edu.stanford.nlp.ie.crf.CRFClassifier \
         -loadClassifier $dir/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz \
         -textFile "$f" \
         -outputFormat inlineXML > "$g"
done

The glob pattern $dir/NYTimesCorpus/*/*/*/*.xml specifies that the wanted xml files are exactly 3 levels below NYTimesCorpus. That that is the wrong depth, then alter the number of */ in the pattern.

If the xml files can appear at varying depths, use find, or in bash use:

shopt -s globstar nullglob
for f in $dir/NYTimesCorpus/**/*.xml; do

reference

查看更多
登录 后发表回答