I want to get all the matching numbers(only numbers example '0012--22') or numbers which contain some text (example 'RF332') corresponding to it which matches with a list of strings provided("my_list" in the code). The format in which the text with number will be present is like separated by a space or two. Providing sample input file for reference.
This is the input file:
$cat input_file
some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
something blah blah Ref.:
tramite 1234567
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content
The script till now is attached below: It is currently only identifying one element which is {'tramite': '1234567'}
import re
import glob
import os
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien']
#open the file as input
with open('garb.txt','r') as infile:
res = dict()
for line in infile:
elems = re.split('(?::)?\s+', line)
if len(elems) >= 2 :
contains = False
tmp = ''
for elem in elems:
if contains:
res.update({tmp : elem})
contains = False
if elem in my_list:
contains = True
tmp = elem
This is the expected output:
Sample output:
{'Expedien N°': '18-0022995'}
{'Expedien N°': '18-0022995'}
{'Expedien': '1-21-212-16-26'}
{'Reference' : 'RE9833'}
etc etc.
You may use
See the regex demo.
Pattern details
- left word boundary (unambiguous, \b
meaning is context dependent and if the next char is a non-word char, it will require a word char on the left, and that is not something users usually expect)
- Capturing group 1: your list of keywords, it can be easily built using '|'.join(map(re.escape,my_list))
(note re.escape
is necessary to escape special regex metacharacters like .
, +
, (
, [
, etc.)
- 0+ non-word chars (chars other than letters, digits or _
- Capturing group 2:
- zero or more uppercase ASCII letters
- 1 or more digits
- 0 or more repetitions of
- one or more hyphens
- zero or more uppercase ASCII letters, 1 or more digits
See the Python demo:
import re
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien']
rx = r'(?<!\w)({})\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)'.format('|'.join(map(re.escape,my_list)))
print(re.findall(rx, s))
[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('tramite', '1234567'), ('Expedien N°', '18-00777'), ('Expedien N°', '18-0022995')]
There really needs to be something that allows users with less than 50+ rep points to comment, because this thread is one I'm really curious about and want to fork off of, but didn't want to have to give a full fledged answer on, because the answer I'm giving involves finite situations and isn't flexible.
@Wiktor Stribiżew
Your solution misses the "Ref." portion of the output based on your demo. It looks like he wants to skip "tramite"
In your desired output you need to edit it because "UV1234" does't show up anywhere in the string you posted
Anyway, I found a solution but am really hoping someone can improve upon this.
>>> import re
>>> string = '''some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
something blah blah Ref.:
tramite 1234567
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content'''
>>> re.findall('(?:(Expedien[\s]+N\S|Ref\.(?!:[\S\s]{,11}Expedien)|Reference|Expedien))[\S\s]*?([A-Z\-]*(?:[\d]+)[\S]*)', string)
[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('Ref.', '1234567'), ('Expedien N\xb0', '18-00777'), ('Expedien N\xb0', '18-0022995')]
The Flaws:
- To capture correctly it relies in part on "Ref.(?!:[\S\s]{,11}Expedien)"
- First of all that "11" needs to be edited to account for other lengths of info that may be present so it is not flexible
- Secondly, if it is instead followed by "Reference" as opposed to then the third "Ref." will be captured incorrectly