Regular expression extracting number dimension

2019-08-05 01:38发布

问题:

I'm using python regular expressions to extract dimensional information from a database. The entries in that column look like this:

23 cm
43 1/2 cm

20cm
15 cm x 30 cm

What I need from this is only the width of the entry (so for the entries with an 'x', only the first number), but as you can see the values are all over the place.

From what I understood in the documentation, you can access the groups in a match using their position, so I was thinking I could determine the type of the entry based on how many groups are returned and what is found at each index.

The expression I used so far is ^(\d{2})\s?(x\s?(\d{2}))?(\d+/\d+)?$, however it's not perfect and it returns a number of useless groups. Is there something more efficient and appropriate?

Edit: I need the number from every line. When there is only one number, it is implied that only the width was measured (including any fractional components such as line 2). When there are two numbers, the height was also measured, but I only need the width which is the first number (such as in the last line)

回答1:

try regex below, it will capture 1st digits and optional fractional come after it before the 1st 'cm'

import re
regex = re.compile('(\d+.*?)\s?cm') # this will works for all your example data
# or
# this asserted whatever come after the 1st digit group must be fractional number only
regex = re.compile('(\d+(?:\s+\d+\/\d+)?)\s?cm') 


>>> regex.match('23 cm').group(1)
>>> '23' 
>>> regex.match('43 1/2 cm').group(1)
>>> '43 1/2'
>>> regex.match('20cm').group(1)
>>> '20'
>>> regex.match('15 cm x 30 cm').group(1)
>>> '15'

regex101 demo



回答2:

Here's a sample of how to do it from a text file. It works for the provided data.

     f = open("textfile.txt",r')

     for line in f :
         if 'x'in line:
             iposition = line.find('x')
             print(line[:iposition])


回答3:

This regex should work (Live Demo)

^(\d+)(?:\s*cm\s+[xX])

Explanation

  • ^(\d+) - capture at least one digit at the beginning of the line
  • (?: - start non-capturing group
  • \s* - followed by at least zero whitespace characters
  • cm - followed by a literal c and m
  • \s+ - followed by at least one whitespace character
  • [xX] - followed by a literal x or X
  • ) - end non-capturing group

You shouldn't need to bother matching the rest of the line.