How to remove trailing comments via regexp?

2019-01-18 06:13发布

For non-MATLAB-savvy readers: not sure what family they belong to, but the MATLAB regexes are described here in full detail. MATLAB's comment character is % (percent) and its string delimiter is ' (apostrophe). A string delimiter inside a string is written as a double-apostophe ('this is how you write "it''s" in a string.'). To complicate matters more, the matrix transpose operators are also apostrophes (A' (Hermitian) or A.' (regular)).

Now, for dark reasons (that I will not elaborate on :), I'm trying to interpret MATLAB code in MATLAB's own language.

Currently I'm trying to remove all trailing comments in a cell-array of strings, each containing a line of MATLAB code. At first glance, this might seem simple:

>> str = 'simpleCommand(); % simple trailing comment';
>> regexprep(str, '%.*$', '')
ans =
    simpleCommand(); 

But of course, something like this might come along:

>> str = ' fprintf(''%d%*c%3.0f\n'', value, args{:}); % Let''s do this! ';
>> regexprep(str, '%.*$', '') 
ans = 
    fprintf('        %//   <-- WRONG!

Obviously, we need to exclude all comment characters that reside inside strings from the match, while also taking into account that a single apostrophe (or a dot-aposrotphe) directly following a statement is an operator, not a string delimiter.

Based on the assumption that the amount of string opening/closing characters before the comment character must be even (which I know is incomplete, because of the matrix-transpose operator), I conjured up the following dynamic regex to handle this sort of case:

>> str = {
       'myFun( {''test'' ''%''}); % let''s '                 
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s '        
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s '       
       'sprintf(str, ''%*8.0f%*s%c%3d\n'');  '
       'A = A.'';%tight trailing comment'
   };
>> 
>> C = regexprep(str, '(^.*)(?@mod(sum(\1==''''''''),2)==0;)(%.*$)', '$1')

However,

C = 
    'myFun( {'test' '%'}); '              %// sucess
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '  %// sucess
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '  %// sucess
    'sprintf(str, '%*8.0f%*s%c'           %// FAIL
    'A = A.';'                            %// success (although I'm not sure why)

so I'm almost there, but not quite yet :)

Unfortunately I've exhausted the amount of time I can spend thinking about this and need to continue with other things, so perhaps someone else who has more time is friendly enough to think about these questions:

  1. Are comment characters inside strings the only exception I need to look out for?
  2. What is the correct and/or more efficient way to do this?

5条回答
Bombasti
2楼-- · 2019-01-18 06:20

How about making sure all apostrophe before the comment come in pairs like this:

>> str = {
       'myFun( {''test'' ''%''}); % let''s '                 
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s '        
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s '       
       'sprintf(str, ''%*8.0f%*s%c%3d\n'');  '
   };

>> C = regexprep(str, '^(([^'']*''[^'']*'')*[^'']*)%.*$', '$1')

C = 
    myFun( {'test' '%'}); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
查看更多
Anthone
3楼-- · 2019-01-18 06:25

I prefer to abuse checkcode (the replacement for old mlint) to do the parsing. Here is a suggestion

function strNC = removeComments(str)
if iscell(str)
    strNC = cellfun(@removeComments, str, 'UniformOutput', false);
elseif regexp(str, '%', 'once')
    err = getCheckCodeId(str);
    strNC = regexprep(str, '%[^%]*$', '');
    errNC = getCheckCodeId(strNC);
    if strcmp(err, errNC),
        strNC = removeComments(strNC);
    else
        strNC = str;
    end
else
    strNC = str;
end
end

function errid = getCheckCodeId(line)
fName = 'someTempFileName.m';
fh = fopen(fName, 'w');
fprintf(fh, '%s\n', line);
fclose(fh);
if exist('checkcode')
    structRep = checkcode(fName, '-id');
else
    structRep = mlint(fName, '-id');
end
delete(fName);
if isempty(structRep)
    errid = '';
else
    errid = structRep.id;
end
end

For each line, it checks if we introduce an error by trimming the line from last % to the end of line.

For your example it returns:

>> removeComments(str)

ans = 

    'myFun( {'test' '%'}); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n');  '
    'A = A.';'

It does not remove the suppression directive, %#ok, so you get:

>> removeComments('a=1; %#ok')

ans =

a=1; %#ok

Which probably is a good thing.

查看更多
叛逆
4楼-- · 2019-01-18 06:26

Look what I found! :)

The comment stripping toolbox, by Peter J. Acklam.

For m-code, it contains the following regex:

mainregex = [ ...
     ' (                   ' ... % Grouping parenthesis (content goes to $1).
     '   ( ^ | \n )        ' ... % Beginning of string or beginning of line.
     '   (                 ' ... % Non-capturing grouping parenthesis.
     '                     ' ...
     '' ... % Match anything that is neither a comment nor a string...
     '       (             ' ... % Non-capturing grouping parenthesis.
     '           [\]\)}\w.]' ... % Either a character followed by
     '           ''+       ' ... %    one or more transpose operators
     '         |           ' ... % or else
     '           [^''%]    ' ... %   any character except single quote (which
     '                     ' ... %   starts a string) or a percent sign (which
     '                     ' ... %   starts a comment).
     '       )+            ' ... % Match one or more times.
     '                     ' ...
     '' ...  % ...or...
     '     |               ' ...
     '                     ' ...
     '' ...  % ...match a string.
     '       ''            ' ... % Opening single quote that starts the string.
     '         [^''\n]*    ' ... % Zero or more chars that are neither single
     '                     ' ... %   quotes (special) nor newlines (illegal).
     '         (           ' ... % Non-capturing grouping parenthesis.
     '           ''''      ' ... % An embedded (literal) single quote character.
     '           [^''\n]*  ' ... % Again, zero or more chars that are neither
     '                     ' ... %   single quotes nor newlines.
     '         )*          ' ... % Match zero or more times.
     '       ''            ' ... % Closing single quote that ends the string.
     '                     ' ...
     '   )*                ' ... % Match zero or more times.
     ' )                   ' ...
     ' [^\n]*              ' ... % What remains must be a comment.
              ];

  % Remove all the blanks from the regex.
  mainregex = mainregex(~isspace(mainregex));

Which becomes

mainregex  = '((^|\n)(([\]\)}\w.]''+|[^''%])+|''[^''\n]*(''''[^''\n]*)*'')*)[^\n]*'

and should be used as

C = regexprep(str, mainregex, '$1')

So far, it's withstood all of my tests, so I think this should solve my problem quite nicely :)

查看更多
一纸荒年 Trace。
5楼-- · 2019-01-18 06:30

This matches conjugate transpose case by checking what characters are allowed before one

  1. Numbers 2'
  2. Letters A'
  3. Dot A.'
  4. Left parenthesis, brace and bracket A(1)', A{1}' and [1 2 3]'

These are the only cases I can think of now.

C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

on your example we it returns

>> C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

C = 

    'myFun( {'test' '%'}); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n');  '
    'A = A.';'
查看更多
相关推荐>>
6楼-- · 2019-01-18 06:33

How do you feel about using undocumented features? If you dont object, you can use the mtree function to parse the code and strip the comments. No regexps involved, and we all know that we shouldn't try to parse context-free grammars using regular expressions.

This function is a full parser of MATLAB code written in pure M-code. As far as I can tell, it is an experimental implementation, but it's already used by Mathworks in a few places (this is the same function used by MATLAB Cody and Contests to measure code length), and can be used for other useful things.

If the input is a cellarray of strings, we do:

>> str = {..};
>> C = deblank(cellfun(@(s) tree2str(mtree(s)), str, 'UniformOutput',false))
C = 
    'myFun( { 'test', '%' } );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'A = A.';'

If you already have an M-file stored on disk, you can strip the comments simply as:

s = tree2str(mtree('myfile.m', '-file'))

If you want to see the comments back, add: mtree(.., '-comments')

查看更多
登录 后发表回答