I just started with Python 3 and ran into the following problem:
I downloaded a good deal of PDFs from different journals for my thesis, but they are all named after their DOI and not in the format “Author (Year) - Title”.
The documents are saved in different directories, according to the journal's name and volume, e.g.:
/Journal 1/
/Vol. 1/
file1.pdf
file1.txt
file2.pdf
file2.txt
filen.pdf
filen.txt
/Vol. 2/
file1.pdf
file1.txt
/Journal 2/
...
Because I have no idea how to read the contents of a PDF with Python, I wrote a very short bash script, that converted the PDFs to simple TXT files. The pdf and txt files have the same name with a different file extension.
I would like to rename all of the PDF files, luckily there is a string in each of the file's continuous text, that I could use. This variable string lies between two static strings:
"Cite this article as: " AUTHOR/YEAR/TITLE ", Journal name".
How do I make Python go into each directory, read the contents of the TXT/PDF, extract the variable string between the two fixed strings and then rename the appropriate PDF file?
If anyone knows how to do this with Python 3, I would be very thankful.
Finally got it to work:
#__author__ = 'Telefonmann'
# -*- coding: utf-8 -*-
import os, re, ntpath, shutil
for root, dirs, files in os.walk(os.getcwd()):
for file in files: # loops through directories and files
if file.endswith(('.txt')): # only processes txt files
full_path = ntpath.splitdrive(ntpath.join(root, file))[1]
# builds correct path under Win 7 (and probably other NT-systems
with open(full_path, 'r', encoding='utf-8') as f:
content = f.read().replace('\n', '') # remove newline
r = re.compile('To\s*cite\s*this\s*article:\s*(.*?),\s*Journal\s*of\s*Quantitative\s*Linguistics\s*,')
m = r.search(content)
# finds substring inbetween "To cite this article: " and "Journal of Quantitative Linguistics,"
# also finds typos like "Journal ofQuantitative ..."
if m:
full_title = m.group(1)
print("full_title: {0}".format(full_title))
full_title = (full_title.replace('<','') # removes/replaces forbidden characters in Windows file names
.replace('>','')
.replace(':',' -')
.replace('"','')
.replace('/','')
.replace('\\','')
.replace('|','')
.replace('?','')
.replace('*',''))
pdf_name = full_path.replace('txt','pdf')
# since txt and pdf files only differ in their format extension I simply replace .txt with .pdf
# to get the right name
print('File: '+ file)
print('Full Path: ' + full_path)
print('Full Title: ' + full_title)
print('PDF Name: ' + pdf_name)
print('....................................')
# for trouble shooting
dirname = ntpath.dirname(pdf_name)
new_path = ntpath.join(dirname, "{0}.pdf".format(full_title))
if ntpath.exists(full_path):
print("all paths found")
shutil.copy(pdf_name, new_path)
# makes a copy of the pdf file with the new name in the respective directory