可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I want to get all the matching numbers(only numbers example '0012--22') or numbers which contain some text (example 'RF332') corresponding to it which matches with a list of strings provided("my_list" in the code). The format in which the text with number will be present is like separated by a space or two. Providing sample input file for reference.

This is the input file:

$cat input_file
some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.: 
tramite  1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content

The script till now is attached below: It is currently only identifying one element which is {'tramite': '1234567'}

import re
import glob
import os

my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien']

#open the file as input
with open('garb.txt','r') as infile:
  res = dict()
  for line in infile:  
    elems = re.split('(?::)?\s+', line)
    #print(elems)
    if len(elems) >= 2 :
      contains = False
      tmp = ''
      for elem in elems:  
        if contains:
          res.update({tmp : elem})
          print(res)
          contains = False
          break
        if elem in my_list:
          contains = True
          tmp = elem
  #print(res)

This is the expected output:

Sample output:

{'Expedien N°': '18-0022995'}
{'Expedien N°': '18-0022995'}
{'Expedien': '1-21-212-16-26'}
{'Reference' : 'RE9833'}

etc etc.

回答1:

You may use

(?<!\w)(your|escaped|keywords|here)\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)

See the regex demo.

Pattern details

(?<!\w) - left word boundary (unambiguous, \b meaning is context dependent and if the next char is a non-word char, it will require a word char on the left, and that is not something users usually expect)
(your|escaped|keywords|here) - Capturing group 1: your list of keywords, it can be easily built using '|'.join(map(re.escape,my_list)) (note re.escape is necessary to escape special regex metacharacters like ., +, (, [, etc.)
\W* - 0+ non-word chars (chars other than letters, digits or _)
([A-Z]*\d+(?:-+[A-Z]*\d+)*) - Capturing group 2:
- [A-Z]* - zero or more uppercase ASCII letters
- \d+ - 1 or more digits
- (?:-+[A-Z]*\d+)* - 0 or more repetitions of
  - -+ - one or more hyphens
  - [A-Z]*\d+ - zero or more uppercase ASCII letters, 1 or more digits

See the Python demo:

import re
s="""your_text_here"""
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien']
rx = r'(?<!\w)({})\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)'.format('|'.join(map(re.escape,my_list)))
print(re.findall(rx, s))

Output:

[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('tramite', '1234567'), ('Expedien N°', '18-00777'), ('Expedien N°', '18-0022995')]

回答2:

There really needs to be something that allows users with less than 50+ rep points to comment, because this thread is one I'm really curious about and want to fork off of, but didn't want to have to give a full fledged answer on, because the answer I'm giving involves finite situations and isn't flexible.

@Wiktor Stribiżew

Your solution misses the "Ref." portion of the output based on your demo. It looks like he wants to skip "tramite"

@checkmate

In your desired output you need to edit it because "UV1234" does't show up anywhere in the string you posted

Anyway, I found a solution but am really hoping someone can improve upon this.

>>> import re

>>> string = '''some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.: 
tramite  1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content'''

>>> re.findall('(?:(Expedien[\s]+N\S|Ref\.(?!:[\S\s]{,11}Expedien)|Reference|Expedien))[\S\s]*?([A-Z\-]*(?:[\d]+)[\S]*)', string)

[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('Ref.', '1234567'), ('Expedien N\xb0', '18-00777'), ('Expedien N\xb0', '18-0022995')]

The Flaws:

To capture correctly it relies in part on "Ref.(?!:[\S\s]{,11}Expedien)"
First of all that "11" needs to be edited to account for other lengths of info that may be present so it is not flexible
Secondly, if it is instead followed by "Reference" as opposed to then the third "Ref." will be captured incorrectly