Word search algorithm using an m.file

2019-09-08 15:49发布

I have already implemented my algorithm using cells of multiple strings on Matlab, but I can't seem to do it through reading a file.

On Matlab, I create cells of strings for each line, let's call them line.

So I get

     line= 'string1' 'string2' etc
     line= 'string 5' 'string7'...
     line=...

and so on. I have over 100s of lines to read.

What I'm trying to do is compare the words from to the first line to itself. Then combine the first and second line, and compare the words in the second line to the combined cell. I accumulate each cell I read and compare with the last cell read.

Here is my code on

for each line= a,b,c,d,...

for(i=1:length(a))
for(j=1:length(a))
  AA=ismember(a,a)
  end

  combine=[a,b]
  [unC,i]=unique(combine, 'first')
  sorted=combine(sort(i))

  for(i=1:length(sorted))
for(j=1:length(b))
  AB=ismember(sorted,b)
 end
 end

 combine1=[a,b,c]

..... When I read my file, I create a while loop which reads the whole script until the end, so how I can I implement my algorithm if all my cells of strings have the same name?

    while~feof(fid)
    out=fgetl(fid)
    if isempty(out)||strncmp(out, '%', 1)||~ischar(out)
    continue
    end
    line=regexp(line, ' ', 'split')

1条回答
小情绪 Triste *
2楼-- · 2019-09-08 16:43

Suppose your data file is called data.txt and its content is:

string1 string2 string3 string4
string2 string3 
string4 string5 string6

A very easy way to retain only the first unique occurrence is:

% Parse everything in one go
fid = fopen('C:\Users\ok1011\Desktop\data.txt');
out = textscan(fid,'%s');
fclose(fid);

unique(out{1})
ans = 
    'string1'
    'string2'
    'string3'
    'string4'
    'string5'
    'string6'

As already mentioned, this approach might not work if:

  • your data file has irregularities
  • you actually need the comparison indices

EDIT: solution for performance

% Parse in bulk and split (assuming you don't know maximum 
%number of strings in a line, otherwise you can use textscan alone)

fid = fopen('C:\Users\ok1011\Desktop\data.txt');
out = textscan(fid,'%s','Delimiter','\n');
out = regexp(out{1},' ','split');
fclose(fid);

% Preallocate unique comb
comb = unique([out{:}]); % you might need to remove empty strings from here

% preallocate idx
m   = size(out,1);
idx = false(m,size(comb,2));

% Loop for number of lines (rows)
for ii = 1:m
    idx(ii,:) = ismember(comb,out{ii});
end

Note that the resulting idx is:

idx =
     1     1     1     1     0     0
     0     1     1     0     0     0
     0     0     0     1     1     1

The advantage of keeping it in this form is that you save on space with respect to a cell array (which imposes 112 bytes of overhead per cell). You can also store it as a sparse array to potentially improve on storage costs.

Another thing to note, is that even if the logical array is longer than the e.g. double array which is indexing, as long as the exceeding elements are false you can still use it (and by construction of the above problem, idx satisfies this requirement). An example to clarify:

A = 1:3;
A([true false true false false])
查看更多
登录 后发表回答