I try to give advice on the format of paper reference. For example, for academic dissertation, the format is:
author. dissertation name[D]. place where store it: organization who hold the copy, year in which the dissertation published.
obviously, there may be some punctuation in every items except for year. for example
Smith. The paper name. The subtitle of paper[D]. United States: MIT, 2011
often, place where store it
and year
are missed, for example
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT
I want to program like this:
import re
reObj = re.compile(
r'.*\[D\]\. \s* ((?P<PLACE>[^:]*):){0,1} \s* (?P<HOLDER>[^:]*) (?P<YEAR>,\s*(1|2)\d{3}){0,1}',
re.VERBOSE
)
txt = '''Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT'''.split('\n')
for i in txt:
if reObj.search(i):
if reObj.search(i).group('PLACE')==None:
print('missing place')
if reObj.search(i).group('YEAR')==None:
print('missing year')
else:
print('bad formation')
but I found that no YEAR are gotten for i in txt: print(i) print(reObj.search(i).group('HOLDER'))
outputs
Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT
MIT
for i in txt:
print(i)
print(reObj.search(i).group('YEAR'))
outputs
Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
None
Smith. The paper name. The subtitle of paper[D]. US, 2011
None
Smith. The paper name. The subtitle of paper[D]. US: MIT
None
So, why my named group fails and how to fix it? thanks