Algorithm to to Cluster Similar Strings in Python?

2019-07-30 16:47发布

问题:

I'm working on a script that currently contains multiple lists of DNA sequences (each list has a varying number of DNA sequences) and I need to cluster the sequences in each list based on Hamming Distance similarity. My current implementation of this (very crude at the moment) extracts the first sequence in the list and calculates the Hamming Distance of each subsequent sequence. If it's within a certain Hamming Distance, it appends it to a new list which later is used to remove sequences from the original list as well as store the similar sequences in a defaultdict. See my current implementation of my code below:

def hamming_dist(sequence1, sequence2):
"""
Calculates the hamming distance between 2 sequences
"""
    assert len(sequence1) == len(sequence2)
    return sum(sequence1 !=sequence2 for sequence1,sequence2 \
    in itertools.izip(sequence1,sequence2))


def group_sequences(sequences_list):
    trash_sequences = []
    main_sequence = sequences_list[0]
    clustered_sequence = defaultdict(list)
    while len(sequences_list) > 1:
        for sequence in sequences_list:
            ham_dist = hamming_dist(main_sequence,sequence)
            if hamming_dist < 30:
                trash_sequences.append(sequence)

        for similar_sequences in trash_sequences:
            sequences_list.remove(similar_sequences)
            clustered_sequence[main_tcr].append(similar_sequences)
    else:
        clustered_sequence[main_sequence].append(None)
    return clustered_sequence

Obviously this isn't the best or most efficient way to do it and even with this implementation, there are still some bugs in my script that I encountered. I read through over StackOverflow/StackExchange questions to see if other people have encountered my problem and of the similar questions I found, many other people have mentioned about using algorithms such as the K-Means algorithm, Markov Clustering method, hierarchy clustering, etc. I'm not too familiar with any of these methods except the K-means method which requires numbers, not strings.

Which clustering method(s) would you suggest I implement to cluster similar DNA sequences together as well as the best way to implement your preferred method of choice? Any suggestions would be much appreciated!

回答1:

The best choice to get started is hierarchical clustering.

It's easy to understand, it allows any distance, and it can visualize the clustering as a tree even when the data itself is hard to visualize.