I am trying to create a file with all function/enum/struct/etc names from a source file. For that, I am at the moment trying to use sed
to accomplish something like this:
(original file)
function add1 (int i) {
return i+1;
}
(output of sed)
function add1 (int i) {
}
In other words, I want to remove the actual contents of the function's body. I could so far not get it to work. Any suggestions?
EDIT: I tried something like this, with no success (for now I am trying to only make the lines on the function's body blank):
sed '/{/,/}/ s/.*//'
Instead of sed
, you could always use awk
in per-character field mode (FS=""
):
awk 'BEGIN {
RS = "\n" ;
FS = "" ;
d = 0 ;
}
{
for (i=1; i<=NF; i++)
if ($i == "{") {
d++ ;
if (d == 1) printf "{\n"
} else
if ($i == "}") {
d-- ;
if (d == 0) printf "}"
} else
if (d == 0)
printf "%s", $i ;
if (d == 0) printf "\n"
}' INPUT-FILE(s)...
The above will skip the contents of any paired curly braces, i.e. function and structure bodies, array initializations, and so on, and output the result to standard output. You can specify one or more files. (If you don't specify any files, it'll expect input from standard input.)
As it is now, it will get confused about braces within quotes or comments. That could be fixed in the same way, but it does get quite complicated fast. This is just a hack to get you most of the way.
I added the semicolons (;
) so you can just stuff everything in the above snippet on one long command line.
The logic of the script is very simple. It uses the empty field separator (FS
), so that every character in input will be their own field. The BEGIN
rule is run once before any input is processed, and sets this up. For developer information, I also initialize d = 0
although it is not necessary for awk since it assumes uninitialized variables to be empty or zero as appropriate. It will track the current brace depth for each input character.
The second braced expression will be executed once per every record. Since I set RS = "\n"
, each line is a separate expression. Thus, it will be executed once per input line. Due to FS = ""
, each character on that line will be a separate field. There are NF
fields in the record: $1
, $2
, .., $(NF-1)
, and $NF
. The three-part if clause simply outputs outermost braces, and everything not within braces (i.e. when d == 0
).
It is possible to extend this awk
scriptlet to encompass comments, strings, character constants (use \047
to refer to a single quote, unless you put the script into a separate file with #!/usr/bin/awk -f
), and to process or ignore preprocessor macros.
It does get a bit complicated, and you'll end up with a couple of hundred lines of awk script, but it should be quite reliable and reasonably fast. The reason it is possible is because the tokenization rules in C in this particular case are easy to follow; I personally would use a full-blown C lexer (lexical analyzer or scanner) in all other use cases. And probably for this, too.
If you want to use a full-blown C lexer, there are a number of them available freely on the net, but you'll have to use a higher level language like C or C++. If you wish to handle all the corner cases, it'll need to incorporate a C/C++ preprocessor, too, but those rules are easy (even with awk).
On a consistently-formatted file, you can do something like
sed '/{$/ {:r;/\n}/!{N;br}; s/\n.*\n/\n/}'
reading the function body at once and deleting everything between curly brackets:
$ echo 'function add1 (int i) {
if (i == 1) {return i+1;}
}' | sed '/{$/ {:r;/\n}/!{N;br}; s/\n.*\n/\n/}'
function add1 (int i) {
}
The command only works on blocks starting with a {
directly before and ending with a }
directly after a newline.
In the :r;/\n}/!{N;br}
part :r
defines a label named r
in which another line is appended to the pattern space from the input (N
), and then the execution flow goes to the beginning of r
again (br
). It only happens until \n}
is encountered. So when we are out of that "loop", we have the whole function body in the pattern space, and then we apply the s
command.
I would first suggest to ensure that your C source file is properly indented. You could use indent -gnu
for that.
Then you could use some sed
tricks. With properly indented code, you only need to care about braces (opening or closing) as the first character of their lines.
I'm not sure to guess why you want to do that. In particular, struct
can be, and sometimes are really, nested. And there are pathological cases -e.g. preprocessor macros defining stuff with braces, etc.
A better way might be to operate on the compiler internals (but then you have to deal with stuff coming from #include
-d headers). You could use MELT for that purpose (MELT is a high level domain specific language to extend GCC, and is working on GCC internals).