I'm using python regular expressions to extract dimensional information from a database. The entries in that column look like this:
23 cm
43 1/2 cm
20cm
15 cm x 30 cm
What I need from this is only the width of the entry (so for the entries with an 'x', only the first number), but as you can see the values are all over the place.
From what I understood in the documentation, you can access the groups in a match using their position, so I was thinking I could determine the type of the entry based on how many groups are returned and what is found at each index.
The expression I used so far is ^(\d{2})\s?(x\s?(\d{2}))?(\d+/\d+)?$
, however it's not perfect and it returns a number of useless groups. Is there something more efficient and appropriate?
Edit: I need the number from every line. When there is only one number, it is implied that only the width was measured (including any fractional components such as line 2). When there are two numbers, the height was also measured, but I only need the width which is the first number (such as in the last line)
Here's a sample of how to do it from a text file. It works for the provided data.
This regex should work (Live Demo)
Explanation
^(\d+)
- capture at least one digit at the beginning of the line(?:
- start non-capturing group\s*
- followed by at least zero whitespace characterscm
- followed by a literalc
andm
\s+
- followed by at least one whitespace character[xX]
- followed by a literalx
orX
)
- end non-capturing groupYou shouldn't need to bother matching the rest of the line.
try regex below, it will capture 1st digits and optional fractional come after it before the 1st 'cm'
regex101 demo