Clustering text in MATLAB

2019-04-01 03:35发布

问题:

I want to do hierarchical agglomerative clustering on texts in MATLAB. Say, I have four sentences,

I have a pen.
I have a paper. 
I have a pencil.
I have a cat. 

I want to cluster the above four sentences to see which are more similar. I know Statistic toolbox has command like pdist to measure pair-wise distances, linkage to calculate the cluster similarity etc. A simple code like:

X=[1 2; 2 3; 1 4];
Y=pdist(X, 'euclidean');
Z=linkage(Y, 'single');
H=dendrogram(Z)

works fine and return a dendrogram.

I wonder can I use these command on the texts as I mentioned above. Any thoughts ?


UPDATES:

Thanks to Amro. Read Understood and computed the distance among strings. Code follows:

clc
S1='I have a pen'; % first String

f_id=fopen('events.txt','r'); %saved strings to compare with
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events); % selects one text randomly.
% store the texts in a cell array

for kk=1:ii

   S2=events(kk);
   S2=cell2mat(S2);
   Z=levenshtein_distance(S1,S2);
   X(kk)=Z;

end 

I input a string and I had 4 saved strings. Now I calculated the pairwise distance using levenshtein_distance function. It returns a matrix X=[ 17 0 16 18 16].

** I guess this is my pair wise distance matrix. Similar to what pdist does. Is it ?

** Now, I'm trying to input X to compute the linkage like

Z=linkage(X, 'single);

Output I'm getting is:

Error using ==> linkage at 93 Size of Y not compatible with the output of the PDIST function.

Error in ==> Untitled2 at 20 Z=linkage(X,'single') .

Why so ? Can use the linkage function at all ? Help appreciated.

UPDATE 2

clc
S1='I have a pen';

f_id=fopen('events.txt','r');
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events)+1; % total number of strings in the comparison

D=zeros(ii, ii); % initialized distance matrix;
for kk=1:ii 

    S2=events(kk);

    %S2=cell2mat(S2);

    for jk=kk+1:ii

  D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

    end

end

D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T).

Error: ??? Cell contents reference from a non-cell array object. Error in ==> Untitled2 at 22 D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

Also, Why am I reading the event from the file inside the first loop ? Doesn't seem logical. Bit confused, if I can work this way or only solution is to input all strings inside the code. Help much appreciated.

UPDATE

code to compare two sentences:

clc
    str1 = 'Fire in NY';
    str2= 'Jeff is sick';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');  

Output D=18.

WITH Different strings:

clc
str1 = 'Fire in NY';
str2= 'NY catches fire';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default'); 

D=28.

Based on distance, a completely different sentence looks similar. What I'm trying to do, If I have stored Fire in NY, I wont store NY catches fire. However, for the first case, I would store as the information is new.

IS LD sufficient to do this ? Help appreciated.

回答1:

What you need is a distance function that can handle strings. Check out the Levenshtein distance (edit distance). There are plenty of implementations out there:

  • Wikibooks.org
  • "Calculation of distance between strings" on FEX

Alternatively, you should extract some interesting features (ex: number of vowels, length of string, etc..) to build a vector space representation, then you can apply any of the usual distance measures (euclidean, ...) on the new representation.


EDIT

The problem with your code is that LINKAGE expects the input distances format to match that of PDIST, namely a row vector corresponding to pairs of observations in the order 1-vs-2, 1-vs-3, 2-vs-3, etc.. which is basically the lower half of the complete distance matrix (since its supposed to be symmetric as dist(1,2) == dist(2,1))

%# instances
str = {'I have a pen.'
    'I have a paper.'
    'I have a pencil.'
    'I have a cat.'};
numStr = numel(str);

%# create and fill upper half only of distance matrix
D = zeros(numStr,numStr);
for i=1:numStr
    for j=i+1:numStr
        D(i,j) = levenshtein_distance(str{i},str{j});
    end
end
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T)

Please refer to the documentation of the functions in question for more information...