How to remove all cells which contain supersets of

2019-09-05 21:40发布

问题:

I am working in text mining. I have 23 sentences that I have extracted from a text file along with 6 frequent words extracted from the same text file.

For frequent words, I created 1D array which shows words and in which sentences they occur. After that I took the intersection to show which word occurs with which each of other remaining words in sentence:

OccursTogether = cell(length(Out1));
for ii=1:length(Out1)
    for jj=ii+1:length(Out1)
        OccursTogether{ii,jj} = intersect(Out1{ii},Out1{jj});
    end
end
celldisp(OccursTogether)

The output is somehow like this:

OccursTogether[1,1]= 4 3
OccursTogether[1,2]= 1 4 3
OccursTogether[1,3]= 4 3

In above [1,1] shows that word number 1 occurs with word 1 in sentence 4 and 3, [1,2] shows word 1 and word 2 occurs in sentence 1 2 and 3 and so on.

What I want to do is to implement an element absorption technique, which will remove all cells which contain supersets of other cells. As we can see above 4 and 3 in [1,1] are subset of [1,2] so OccursTogether[1,2] entry should be deleted and output should be as follows:

occurs[1,1]= 4 3
occurs[1,3]= 4 3

Remember this should check all the possible subsets of entries in the system.

回答1:

I think this does what you want:

[ii, jj] = ndgrid(1:numel(OccursTogether));
s = cellfun(@(x,y) all(ismember(x,y)), OccursTogether(ii), OccursTogether(jj));
s = triu(s,1); %// count each pair just once, and remove self-pairs
result = OccursTogether(~any(s,1));

Example 1:

OccursTogether{1,1} = [4 3]
OccursTogether{1,2} = [1 4 3]
OccursTogether{1,3} = [1 4 3 5];
OccursTogether{1,4} = [1 4 3 5];

gives

>> celldisp(result)
result{1} =
     4     3

OccursTogether{1,2} is removed because it's a superset of OccursTogether{1,1}. OccursTogether{1,3} is removed because it's a superset of OccursTogether{1,2}. OccursTogether{1,4} is removed because it's a superset of OccursTogether{1,3}.

Example 2:

OccursTogether{1,1} = [10 20 30]
OccursTogether{1,2} = [10 20 30]

gives

>> celldisp(result)
result{1} =
    10    20    30

OccursTogether{1,2} is removed because it's a superset of OccursTogether{1,1}, but OccursTogether{1,1} is not removed even if it's a superset of OccursTogether{1,2}. The comparison is done only with previous sets (third line of code).