For non-MATLAB-savvy readers: not sure what family they belong to, but the MATLAB regexes are described here in full detail. MATLAB's comment character is %
(percent) and its string delimiter is '
(apostrophe). A string delimiter inside a string is written as a double-apostophe ('this is how you write "it''s" in a string.'
). To complicate matters more, the matrix transpose operators are also apostrophes (A'
(Hermitian) or A.'
(regular)).
Now, for dark reasons (that I will not elaborate on :), I'm trying to interpret MATLAB code in MATLAB's own language.
Currently I'm trying to remove all trailing comments in a cell-array of strings, each containing a line of MATLAB code. At first glance, this might seem simple:
>> str = 'simpleCommand(); % simple trailing comment';
>> regexprep(str, '%.*$', '')
ans =
simpleCommand();
But of course, something like this might come along:
>> str = ' fprintf(''%d%*c%3.0f\n'', value, args{:}); % Let''s do this! ';
>> regexprep(str, '%.*$', '')
ans =
fprintf(' %// <-- WRONG!
Obviously, we need to exclude all comment characters that reside inside strings from the match, while also taking into account that a single apostrophe (or a dot-aposrotphe) directly following a statement is an operator, not a string delimiter.
Based on the assumption that the amount of string opening/closing characters before the comment character must be even (which I know is incomplete, because of the matrix-transpose operator), I conjured up the following dynamic regex to handle this sort of case:
>> str = {
'myFun( {''test'' ''%''}); % let''s '
'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s '
'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s '
'sprintf(str, ''%*8.0f%*s%c%3d\n''); '
'A = A.'';%tight trailing comment'
};
>>
>> C = regexprep(str, '(^.*)(?@mod(sum(\1==''''''''),2)==0;)(%.*$)', '$1')
However,
C =
'myFun( {'test' '%'}); ' %// sucess
'sprintf(str, '%*8.0f%*s%c%3d\n'); ' %// sucess
'sprintf(str, '%*8.0f%*s%c%3d\n'); ' %// sucess
'sprintf(str, '%*8.0f%*s%c' %// FAIL
'A = A.';' %// success (although I'm not sure why)
so I'm almost there, but not quite yet :)
Unfortunately I've exhausted the amount of time I can spend thinking about this and need to continue with other things, so perhaps someone else who has more time is friendly enough to think about these questions:
- Are comment characters inside strings the only exception I need to look out for?
- What is the correct and/or more efficient way to do this?
How about making sure all apostrophe before the comment come in pairs like this:
I prefer to abuse checkcode (the replacement for old mlint) to do the parsing. Here is a suggestion
For each line, it checks if we introduce an error by trimming the line from last
%
to the end of line.For your example it returns:
It does not remove the suppression directive,
%#ok
, so you get:Which probably is a good thing.
Look what I found! :)
The comment stripping toolbox, by Peter J. Acklam.
For m-code, it contains the following regex:
Which becomes
and should be used as
So far, it's withstood all of my tests, so I think this should solve my problem quite nicely :)
This matches conjugate transpose case by checking what characters are allowed before one
2'
A'
A.'
A(1)'
,A{1}'
and[1 2 3]'
These are the only cases I can think of now.
on your example we it returns
How do you feel about using undocumented features? If you dont object, you can use the
mtree
function to parse the code and strip the comments. No regexps involved, and we all know that we shouldn't try to parse context-free grammars using regular expressions.This function is a full parser of MATLAB code written in pure M-code. As far as I can tell, it is an experimental implementation, but it's already used by Mathworks in a few places (this is the same function used by MATLAB Cody and Contests to measure code length), and can be used for other useful things.
If the input is a cellarray of strings, we do:
If you already have an M-file stored on disk, you can strip the comments simply as:
If you want to see the comments back, add:
mtree(.., '-comments')