Python regular expression for Windows file path

2019-08-25 03:11发布

The problem, and it may not be easily solved with a regex, is that I want to be able to extract a Windows file path from an arbitrary string. The closest that I have been able to come (I've tried a bunch of others) is using the following regex:

[a-zA-Z]:\\([a-zA-Z0-9() ]*\\)*\w*.*\w*

Which picks up the start of the file and is designed to look at patterns (after the initial drive letter) of strings followed by a backslash and ending with a file name, optional dot, and optional extension.

The difficulty is what happens, next. Since the maximum path length is 260 characters, I only need to count 260 characters beyond the start. But since spaces (and other characters) are allowed in file names I would need to make sure that there are no additional backslashes that could indicate that the prior characters are the name of a folder and that what follows isn't the file name, itself.

I am pretty certain that there isn't a perfect solition (the perfect being the enemy of the good) but I wondered if anyone could suggest a "best possible" solution?

1条回答
我想做一个坏孩纸
2楼-- · 2019-08-25 04:04

Here's the expression I got, based on yours, that allow me to get the path on windows : [a-zA-Z]:\\((?:[a-zA-Z0-9() ]*\\)*).* . An example of it being used is available here : https://regex101.com/r/SXUlVX/1

First, I changed the capture group from ([a-zA-Z0-9() ]*\\)* to ((?:[a-zA-Z0-9() ]*\\)*).
Your original expression captures each XXX\ one after another (eg : Users\ the Users\).
Mine matches (?:[a-zA-Z0-9() ]*\\)*. This allows me to capture the concatenation of XXX\YYYY\ZZZ\ before capturing. As such, it allows me to get the full path.

The second change I made is related to the filename : I'll just match any group of character that does not contain \ (the capture group being greedy). This allows me to take care of strange file names.

Another regex that would work would be : [a-zA-Z]:\\((?:.*?\\)*).* as shown in this example : https://regex101.com/r/SXUlVX/2

This time, I used .*?\\ to match the XXX\ parts of the path.
.*? will match in a non-greedy way : thus, .*?\\ will match the bare minimum of text followed by a back-slash.

Do not hesitate if you have any question regarding the expressions.
I'd also encourage you to try to see how well your expression works using : https://regex101.com . This also has a list of the different tokens you can use in your regex.

Edit : As my previous answer did not work (though I'll need to spend some times to find out exactly why), I looked for another way to do what you want. And I managed to do so using string splitting and joining.
The command is "\\".join(TARGETSTRING.split("\\")[1:-1]).
How does this work : Is plit the original string into a list of substrings, based. I then remove the first and last part ([1:-1]from 2nd element to the one before the last) and transform the resulting list back into a string.

This works, whether the value given is a path or the full address of a file. Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred is a file path Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred\ is a directory path

查看更多
登录 后发表回答