I tired this:
for dir in /home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2007/02/*/
for f in *.xml ; do
echo $f | grep -q '_output\.xml$' && continue # skip output files
g="$(basename $f .xml)_output.xml"
java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
done
done
which is based on the answer to this question, but that didn't work.
I have a folder stucture such that within the directory NYTimesCorpus
there is a directory 2007
and within that a directory 01
and also 02
, 03
, and so on...
then within 01
there is again 01
,02
,03
,...
in each of these terminal directories there are many .xml files to which I want to apply the script:
for f in *.xml ; do
echo $f | grep -q '_output\.xml$' && continue # skip output files
g="$(basename $f .xml)_output.xml"
java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
done
but there are so many different directories that running it within each dirctory is a form of rare torture. apart from 2007
I also have 2006
and 2005
, so ideally what I would like to do is run it once and have the program just navigate that structure on its own.
My attempts this far have not been successful, perhaps one among you would know how to achieve this?
Thank you for your consideration.
UPDATE
textFile=./scrypt.sh
outputFormat=inlineXML
Loading classifier from /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz ... done [2.2 sec].
CRFClassifier tagged 71 words in 5 documents at 959.46 words per second.
CRFClassifier invoked on Sun Apr 12 19:33:34 HKT 2015 with arguments:
-loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile ./scrypt.sh -outputFormat inlineXML
loadClassifier=/home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz
I would use
find
since it works recursively:For better readability I would save the actions that should be executed on each file to
script.sh
:and make it executable:
find
is a good solution. It sounds like all the xml files are at the same directory depth, so try this:The glob pattern
$dir/NYTimesCorpus/*/*/*/*.xml
specifies that the wanted xml files are exactly 3 levels below NYTimesCorpus. That that is the wrong depth, then alter the number of*/
in the pattern.If the xml files can appear at varying depths, use
find
, or in bash use:reference