Good Style in Python Objects

2020-04-10 01:33发布

问题:

Most of my programming prior to Python was in C++ or Matlab. I don't have a degree in CS (almost completed a PhD in physics), but have done some courses and a good amount of actual programming. Now, I'm taking an algorithms course on Coursera (excellent course, by the way, with a professor from Stanford). I decided to implement the homeworks in Python. However, sometimes I find myself wanting things the language does not so easily support. I'm very used to creating classes and objects for things in C++ just to group together data (i.e. when there are no methods). In Python however, where you can add fields on the fly, what I basically end up wanting all the time are Matlab structs. I think this is possibly a sign I am not using good style and doing things the "Pythonic" way.

Underneath is my implementation of a union-find data structure (for Kruskal's algorithm). Although the implementation is relatively short and works well (there isn't much error checking), there are a few odd points. For instance, my code assumes that the data originally passed in to the union-find is a list of objects. However, if a list of explicit pieces of data are passed in instead (i.e. a list of ints), the code fails. Is there some much clearer, more Pythonic way to implement this? I have tried to google this, but most examples are very simple and relate more to procedural code (i.e. the "proper" way to do a for loop in python).

class UnionFind:
    def __init__(self,data):
        self.data = data

        for d in self.data:
            d.size = 1
            d.leader = d
            d.next = None
            d.last = d

    def find(self,element):
        return element.leader

    def union(self,leader1,leader2):
        if leader1.size >= leader2.size:
            newleader = leader1
            oldleader = leader2
        else:
            newleader = leader2
            oldleader = leader1

        newleader.size = leader1.size + leader2.size

        d = oldleader
        while d != None:
            d.leader = newleader
            d = d.next

        newleader.last.next = oldleader
        newleader.last = oldleader.last

        del(oldleader.size)
        del(oldleader.last)    

回答1:

Generally speaking, doing this sort of thing Pythonically means that you try to make your code not care what is given to it, at least not any more than it really needs to.

Let's take your particular example of the union-find algorithm. The only thing that the union-find algorithm actually does with the values you pass to it is compare them for equality. So to make a generally useful UnionFind class, your code shouldn't rely on the values it receives having any behavior other than equality testing. In particular, you shouldn't rely on being able to assign arbitrary attributes to the values.

The way I would suggest getting around this is to have UnionFind use wrapper objects which hold the given values and any attributes you need to make the algorithm work. You can use namedtuple as suggested by another answer, or make a small wrapper class. When an element is added to the UnionFind, you first wrap it in one of these objects, and use the wrapper object to store the attributes leader, size, etc. The only time you access the thing being wrapped is to check whether it is equal to another value.

In practice, at least in this case, it should be safe to assume that your values are hashable, so that you can use them as keys in a Python dictionary to find the wrapper object corresponding to a given value. Of course, not all objects in Python are necessarily hashable, but those that are not are relatively rare and it's going to be a lot more work to make a data structure that is able to handle those.



回答2:

The more pythonic way is to avoid tedious objects if you don't have to.

class UnionFind(object):
    def __init__(self, members=10, data=None):
        """union-find data structure for Kruskal's algorithm
        members are ignored if data is provided
        """
        if not data:
            self.data = [self.default_data() for i in range(members)]
            for d in self.data:
                d.size   = 1
                d.leader = d
                d.next   = None
                d.last   = d
        else:
            self.data = data

    def default_data(self):
        """create a starting point for data"""
        return Data(**{'last': None, 'leader':None, 'next': None, 'size': 1})

    def find(self, element):
        return element.leader

    def union(self, leader1, leader2):
        if leader2.leader is leader1:
            return
        if leader1.size >= leader2.size:
            newleader = leader1
            oldleader = leader2
        else:
            newleader = leader2
            oldleader = leader1

        newleader.size = leader1.size + leader2.size

        d = oldleader
        while d is not None:
            d.leader = newleader
            d = d.next

        newleader.last.next = oldleader
        newleader.last = oldleader.last

        oldleader.size = 0
        oldleader.last = None

class Data(object):
    def __init__(self, **data_dict):
        """convert a data member dict into an object"""
        self.__dict__.update(**data_dict)


回答3:

One option is to use dictionaries to store the information you need about a data item, rather than attributes on the item directly. For instance, rather than referring to d.size you could refer to size[d] (where size is a dict instance). This requires that your data items be hashable, but they don't need to allow attributes to be assigned on them.

Here's a straightforward translation of your current code to use this style:

class UnionFind:
    def __init__(self,data):
        self.data = data
        self.size = {d:1 for d in data}
        self.leader = {d:d for d in data}
        self.next = {d:None for d in data}
        self.last = {d:d for d in data}

    def find(self,element):
        return self.leader[element]

    def union(self,leader1,leader2):
        if self.size[leader1] >= self.size[leader2]:
            newleader = leader1
            oldleader = leader2
        else:
            newleader = leader2
            oldleader = leader1

        self.size[newleader] = self.size[leader1] + self.size[leader2]

        d = oldleader
        while d != None:
            self.leader[d] = newleader
            d = self.next[d]

        self.next[self.last[newleader]] = oldleader
        self.last[newleader] = self.last[oldleader]

A minimal test case:

>>> uf = UnionFind(list(range(100)))
>>> uf.find(10)
10
>>> uf.find(20)
20
>>> uf.union(10,20)
>>> uf.find(10)
10
>>> uf.find(20)
10

Beyond this, you could also consider changing your implementation a bit to require less initialization. Here's a version that doesn't do any initialization (it doesn't even need to know the set of data it's going to work on). It uses path compression and union-by-rank rather than always maintaining an up-to-date leader value for all members of a set. It should be asymptotically faster than your current code, especially if you're doing a lot of unions:

class UnionFind:
    def __init__(self):
        self.rank = {}
        self.parent = {}

    def find(self, element):
        if element not in self.parent: # leader elements are not in `parent` dict
            return element
        leader = self.find(self.parent[element]) # search recursively
        self.parent[element] = leader # compress path by saving leader as parent
        return leader

    def union(self, leader1, leader2):
        rank1 = self.rank.get(leader1,1)
        rank2 = self.rank.get(leader2,1)

        if rank1 > rank2: # union by rank
            self.parent[leader2] = leader1
        elif rank2 > rank1:
            self.parent[leader1] = leader2
        else: # ranks are equal
            self.parent[leader2] = leader1 # favor leader1 arbitrarily
            self.rank[leader1] = rank1+1 # increment rank


回答4:

For checking if an argument is of the expected type, use the built-in isinstance() function:

if not isinstance(leader1, UnionFind):
    raise ValueError('leader1 must be a UnionFind instance')

Additionally, it is a good habit to add docstrings to functions, classes and member functions. Such a docstring for a function or method should describe what it does, what arguments are to be passed to it and if applicable what is returned and which exceptions can be raised.



回答5:

I'm guessing that the indentation issues here are just simple errors with inputting the code into SO. Could you possibly create a subclass of a simple, built in data type? For instance, you can create a sub-class of the list data type by putting the datatype in parenthesis:

class UnionFind(list):
'''extends list object'''