Get number present after a particular pattern of a

I want to get all the matching numbers(only numbers example '0012--22') or numbers which contain some text (example 'RF332') corresponding to it which matches with a list of strings provided("my_list" in the code). The format in which the text with number will be present is like separated by a space or two. Providing sample input file for reference.

This is the input file:

$cat input_file
some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.: 
tramite  1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content

The script till now is attached below: It is currently only identifying one element which is {'tramite': '1234567'}

import re
import glob
import os

my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien']

#open the file as input
with open('garb.txt','r') as infile:
  res = dict()
  for line in infile:  
    elems = re.split('(?::)?\s+', line)
    #print(elems)
    if len(elems) >= 2 :
      contains = False
      tmp = ''
      for elem in elems:  
        if contains:
          res.update({tmp : elem})
          print(res)
          contains = False
          break
        if elem in my_list:
          contains = True
          tmp = elem
  #print(res)

This is the expected output:

Sample output:

{'Expedien N°': '18-0022995'}
{'Expedien N°': '18-0022995'}
{'Expedien': '1-21-212-16-26'}
{'Reference' : 'RE9833'}

etc etc.

标签： python regex string-matching

2条回答

该账号已被封号

2楼-- · 2019-05-21 13:01

There really needs to be something that allows users with less than 50+ rep points to comment, because this thread is one I'm really curious about and want to fork off of, but didn't want to have to give a full fledged answer on, because the answer I'm giving involves finite situations and isn't flexible.

@Wiktor Stribiżew

Your solution misses the "Ref." portion of the output based on your demo. It looks like he wants to skip "tramite"

@checkmate

In your desired output you need to edit it because "UV1234" does't show up anywhere in the string you posted

Anyway, I found a solution but am really hoping someone can improve upon this.

>>> import re

>>> string = '''some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.: 
tramite  1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content'''

>>> re.findall('(?:(Expedien[\s]+N\S|Ref\.(?!:[\S\s]{,11}Expedien)|Reference|Expedien))[\S\s]*?([A-Z\-]*(?:[\d]+)[\S]*)', string)

[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('Ref.', '1234567'), ('Expedien N\xb0', '18-00777'), ('Expedien N\xb0', '18-0022995')]

The Flaws:

To capture correctly it relies in part on "Ref.(?!:[\S\s]{,11}Expedien)"
First of all that "11" needs to be edited to account for other lengths of info that may be present so it is not flexible
Secondly, if it is instead followed by "Reference" as opposed to then the third "Ref." will be captured incorrectly

0人赞添加讨论(0) 举报

一纸荒年 Trace。

3楼-- · 2019-05-21 13:06

You may use

(?<!\w)(your|escaped|keywords|here)\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)

See the regex demo.

Pattern details

(?<!\w) - left word boundary (unambiguous, \b meaning is context dependent and if the next char is a non-word char, it will require a word char on the left, and that is not something users usually expect)
(your|escaped|keywords|here) - Capturing group 1: your list of keywords, it can be easily built using '|'.join(map(re.escape,my_list)) (note re.escape is necessary to escape special regex metacharacters like ., +, (, [, etc.)
\W* - 0+ non-word chars (chars other than letters, digits or _)
([A-Z]*\d+(?:-+[A-Z]*\d+)*) - Capturing group 2:
- [A-Z]* - zero or more uppercase ASCII letters
- \d+ - 1 or more digits
- (?:-+[A-Z]*\d+)* - 0 or more repetitions of
  - -+ - one or more hyphens
  - [A-Z]*\d+ - zero or more uppercase ASCII letters, 1 or more digits

See the Python demo:

import re
s="""your_text_here"""
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien']
rx = r'(?<!\w)({})\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)'.format('|'.join(map(re.escape,my_list)))
print(re.findall(rx, s))

Output:

[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('tramite', '1234567'), ('Expedien N°', '18-00777'), ('Expedien N°', '18-0022995')]

0人赞添加讨论(0) 举报

Get number present after a particular pattern of a

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间