I want to get all the matching numbers(only numbers example '0012--22') or numbers which contain some text (example 'RF332') corresponding to it which matches with a list of strings provided("my_list" in the code). The format in which the text with number will be present is like separated by a space or two. Providing sample input file for reference.
This is the input file:
$cat input_file
some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.:
tramite 1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content
The script till now is attached below: It is currently only identifying one element which is {'tramite': '1234567'}
import re
import glob
import os
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien']
#open the file as input
with open('garb.txt','r') as infile:
res = dict()
for line in infile:
elems = re.split('(?::)?\s+', line)
#print(elems)
if len(elems) >= 2 :
contains = False
tmp = ''
for elem in elems:
if contains:
res.update({tmp : elem})
print(res)
contains = False
break
if elem in my_list:
contains = True
tmp = elem
#print(res)
This is the expected output:
Sample output:
{'Expedien N°': '18-0022995'}
{'Expedien N°': '18-0022995'}
{'Expedien': '1-21-212-16-26'}
{'Reference' : 'RE9833'}
etc etc.
You may use
(?<!\w)(your|escaped|keywords|here)\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)
See the regex demo.
Pattern details
(?<!\w)
- left word boundary (unambiguous, \b
meaning is context dependent and if the next char is a non-word char, it will require a word char on the left, and that is not something users usually expect)
(your|escaped|keywords|here)
- Capturing group 1: your list of keywords, it can be easily built using '|'.join(map(re.escape,my_list))
(note re.escape
is necessary to escape special regex metacharacters like .
, +
, (
, [
, etc.)
\W*
- 0+ non-word chars (chars other than letters, digits or _
)
([A-Z]*\d+(?:-+[A-Z]*\d+)*)
- Capturing group 2:
[A-Z]*
- zero or more uppercase ASCII letters
\d+
- 1 or more digits
(?:-+[A-Z]*\d+)*
- 0 or more repetitions of
-+
- one or more hyphens
[A-Z]*\d+
- zero or more uppercase ASCII letters, 1 or more digits
See the Python demo:
import re
s="""your_text_here"""
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien']
rx = r'(?<!\w)({})\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)'.format('|'.join(map(re.escape,my_list)))
print(re.findall(rx, s))
Output:
[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('tramite', '1234567'), ('Expedien N°', '18-00777'), ('Expedien N°', '18-0022995')]
There really needs to be something that allows users with less than 50+ rep points to comment, because this thread is one I'm really curious about and want to fork off of, but didn't want to have to give a full fledged answer on, because the answer I'm giving involves finite situations and isn't flexible.
@Wiktor Stribiżew
Your solution misses the "Ref." portion of the output based on your demo. It looks like he wants to skip "tramite"
@checkmate
In your desired output you need to edit it because "UV1234" does't show up anywhere in the string you posted
.
Anyway, I found a solution but am really hoping someone can improve upon this.
>>> import re
>>> string = '''some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.:
tramite 1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content'''
>>> re.findall('(?:(Expedien[\s]+N\S|Ref\.(?!:[\S\s]{,11}Expedien)|Reference|Expedien))[\S\s]*?([A-Z\-]*(?:[\d]+)[\S]*)', string)
[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('Ref.', '1234567'), ('Expedien N\xb0', '18-00777'), ('Expedien N\xb0', '18-0022995')]
The Flaws:
- To capture correctly it relies in part on "Ref.(?!:[\S\s]{,11}Expedien)"
- First of all that "11" needs to be edited to account for other lengths of info that may be present so it is not flexible
- Secondly, if it is instead followed by "Reference" as opposed to then the third "Ref." will be captured incorrectly