Python finding the common parts of a string throug

2020-07-27 05:23发布

I have a list of file directories that looks similar to this:

path/new/stuff/files/morefiles/A/file2.txt
path/new/stuff/files/morefiles/B/file7.txt
path/new/stuff/files/morefiles/A/file1.txt
path/new/stuff/files/morefiles/C/file5.txt

I am trying to remove the beginnings of the paths that are the same from every list, and then deleting that from each file.

The list can be any length, and in the example I would be trying to change the list into:

A/file2.txt
B/file7.txt
A/file1.txt
C/file5.txt

Methods like re.sub(r'.*I', 'I', filepath) and filepath.split('_', 1)[-1] can be used for the replacing, but I'm not sure about how to find the common parts in the list of filepaths

Note:

I am using Windows and python 3

4条回答
Viruses.
2楼-- · 2020-07-27 06:05

The first part of the answer is here: Python: Determine prefix from a set of (similar) strings

Use os.path.commonprefix() to find the longest common (first part) of the string

The code for selecting the part of the list that is the same as from that answer is:

# Return the longest prefix of all list elements.
def commonprefix(m):
    "Given a list of pathnames, returns the longest common leading component"
    if not m: return ''
    s1 = min(m)
    s2 = max(m)
    for i, c in enumerate(s1):
        if c != s2[i]:
            return s1[:i]
    return s1

Now all you have to do is use slicing to remove the resulting string from each item in the list

This results in:

# Return the longest prefix of all list elements.
def commonprefix(m):
    "Given a list of pathnames, returns the longest common leading component"
    if not m: return ''
    s1 = min(m)
    s2 = max(m)
    for i, c in enumerate(s1):
        if c != s2[i]:
            ans = s1[:i]
            break
    for each in range(len(m)):
        m[each] = m[each].split(ans, 1)[-1]
    return m
查看更多
劳资没心,怎么记你
3楼-- · 2020-07-27 06:09

You can split the paths around '/', use zip_longest to avoid cutting long paths and to transpose the paths.

You can then remove the common elements, zip again to transpose the paths and join them with '/':

paths = ['path/new/stuff/files/morefiles/A/file2.txt',
'path/new/stuff/files/morefiles/B/file7.txt',
'path/new/stuff/files/morefiles/A/file1.txt',
'path/new/stuff/files/morefiles/A/file1/file2.txt',
'path/new/stuff/files/morefiles/C/file5.txt']

from itertools import izip_longest
transposed = list(izip_longest(*[path.split('/') for path in paths]))
print(transposed)
# [('path', 'path', 'path', 'path', 'path'), ('new', 'new', 'new', 'new', 'new'), ('stuff', 'stuff', 'stuff', 'stuff', 'stuff'), ('files', 'files', 'files', 'files', 'files'), ('morefiles', 'morefiles', 'morefiles', 'morefiles', 'morefiles'), ('A', 'B', 'A', 'A', 'C'), ('file2.txt', 'file7.txt', 'file1.txt', 'file1', 'file5.txt'), (None, None, None, 'file2.txt', None)]
while len(set(transposed[0])) == 1:
    transposed.pop(0)

print(transposed)
# [('A', 'B', 'A', 'A', 'C'), ('file2.txt', 'file7.txt', 'file1.txt', 'file1', 'file5.txt'), (None, None, None, 'file2.txt', None)]
print(['/'.join(filter(None, path)) for path in zip(*transposed)])
# ['A/file2.txt', 'B/file7.txt', 'A/file1.txt', 'A/file1/file2.txt', 'C/file5.txt']
查看更多
贪生不怕死
4楼-- · 2020-07-27 06:10

Already answered here Python: Determine prefix from a set of (similar) strings

"Never rewrite what is provided to you": Use os.path.commonprefix() to find the longest common prefix, and then slice your strings accordingly.

查看更多
疯言疯语
5楼-- · 2020-07-27 06:12

As the input list contains not just a strings but filenames it seems reasonable to me to consider the common prefix among all filepaths only as a whole-word sequences/sections.

Let's say one of the filepaths is path/new/stuff2/files/morefiles/C/file5.txt.
The common prefix is determined as path/new/stuff, but the 3rd section stuff2 will be breaked at the last character 2.
So that the lastly mentioned commonprefix() implementation will cut such filepath to 2/files/morefiles making it broken and non-accessible(in terms of filesystem). In such case it would be reasonable to cut only the first common whole-word sections (i.e. path/new/).


The solution using zip() function and set object:
The input list of filepaths was slightly modified for demonstration purpose: the last filepath differs on 3rd section .../stuffall/...:

paths = [
    'path/new/stuff/files/morefiles/A/file2.txt', 'path/new/stuff/files/morefiles/B/file7.txt',
    'path/new/stuff/files/morefiles/A/file1.txt', 'path/new/stuffall/files/morefiles/C/file5.txt'
]
c_prefix = ''  # common filpath prefix

for i in zip(*paths):
    s = set(i)
    if len(s) == 1:
        c_prefix += s.pop()
    else:
        if c_prefix:
            # considering only the whole-word sections as a common parts
            paths = [ p.replace(c_prefix if c_prefix.endswith('/') else c_prefix[:c_prefix.rfind('/')+1] , '')
                      for p in paths ]
        break

print(paths)

The output:

['stuff/files/morefiles/A/file2.txt', 'stuff/files/morefiles/B/file7.txt', 'stuff/files/morefiles/A/file1.txt', 'stuffall/files/morefiles/C/file5.txt']
查看更多
登录 后发表回答