Hi guys I have a following code:
from math import sqrt
array = [(1,'a',10), (2,'a',11), (3,'c',200), (60,'a',12), (70,'t',13), (80,'g',300), (100,'a',305), (220,'c',307), (230,'t',306), (250,'g',302)]
def stat(lst):
"""Calculate mean and std deviation from the input list."""
n = float(len(lst))
mean = sum([pair[0] for pair in lst])/n
## mean2 = sum([pair[2] for pair in lst])/n
stdev = sqrt((sum(x[0]*x[0] for x in lst) / n) - (mean * mean))
## stdev2 = sqrt((sum(x[2]*x[2] for x in lst) / n) - (mean2 * mean2))
return mean, stdev
def parse(lst, n):
cluster = []
for i in lst:
if len(cluster) <= 1: # the first two values are going directly in
cluster.append(i)
continue
###### add also the distance between lengths
mean,stdev = stat(cluster)
if (abs(mean - i[0]) > n * stdev): # check the "distance"
yield cluster
cluster[:] = [] # reset cluster to the empty list
cluster.append(i)
yield cluster # yield the last cluster
for cluster in parse(array, 7):
print(cluster)
What it does it clusters my list of tuples (array) by looking at the variable i[0]. What I want to also implement is further cluster it also by variable i[2] in each of my tuple.
Current output is:
[(1, 'a', 10), (2, 'a', 11), (3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13), (80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]
and I would like sth like:
[(1, 'a', 10), (2, 'a', 11)]
[(3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13)]
[(80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]
So the values of i[0] are close by and i[2] also. Any ideas how to crack it?
First of all, your way of computing variance is numerically unstable.
E(X^2)-E(X)^2
holds mathematically, but kills numerical precision. Worst case is you get a negative value, andsqrt
then fails.You really should look into
numpy
which can compute this properly for you.Conceptually, have you considered treating your data as a 2-dimensional data space? You could then whiten it, and run e.g. k-means or any other vector based clustering algorithm.
Standard deviation and mean are trivial to abstract to multiple attributes (look up "Mahalanobis distance").
You can second time use your
parse
method for results from first running. In this case you will receive not exactly the same you want but very similar: