I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).
Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).
It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:
- I'm receiving pure text and not file names, so I can't use the extension as a hint.
- The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).
- it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.
Examples:
- If the text contains "x->y()", I may know for sure it's C++ (?)
- If the text contains "public static void main", I may know for sure it's Java (?)
- If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)
My question:
- Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?
- What are the unique code "tokens" with which I could certainly differentiate one language from another?
I'm writing my code in Python but I believe the question to be language agnostic.
Thanks
One program I know which even can distinguish several different languages within the same file is ohcount. You might get some ideas there, although I don't really know how they do it.
In general you can look for distinctive patterns:
:=
for Pascal/Modula/Oberon,=>
or the whole of LINQ in C#You may create a set of rules, each of which indicates a possible set of languages if it matches. Intersecting the resulting lists will hopefully get you only one language.
The problem with this approach however, is that you need to do tokenizing and compare tokens (otherwise you can't really know what operators are or whether something you found was inside a comment or string). Tokenizing rules are different for each language as well, though; just splitting everything at whitespace and punctuation will probably not yield a very useful sequence of tokens. You can try several different tokenizing rules (each of which would indicate a certain set of languages as well) and have your rules match to a specified tokenization. For example, trying to find a single-quoted string (for trying out Pascal) in a VB snippet with one comment will probably fail, but another tokenizer might have more luck.
But since you want to perform analysis anyway you probably have parsers for the languages you support, so you can just try running the snippet through each parser and take that as indicator which language it would be (as suggested by OregonGhost as well).
Some thoughts:
$x->y() would be valid in PHP, so ensure that there's no $ symbol if you think C++ (though I think you can store function pointers in a C struct, so this could also be C).
public static void main
is Java if it is cased properly - write Main and it's C#. This gets complicated if you take case-insensitive languages like many scripting languages or Pascal into account. The [] attribute syntax in C# on the other hand seems to be rather unique.You can also try to use the keywords of a language - for example,
Option Strict
orEnd Sub
are typical for VB and the like, whileyield
is likely C# andinitialization
/implementation
are Object Pascal / Delphi.If your application is analyzing the source code anyway, you code try to throw your analysis code at it for every language and if it fails really bad, it was the wrong language :)
build a generic tokenizer and then use a Bayesian filter on them. Use the existing "user checks a box" system to train it.
There is no way of making this foolproof, but I would personally start with operators, since they are in most cases "set in stone" (I can't say this holds true to every language since I know only a limited set). This would narrow it down quite considerably, but not nearly enough. For instance "->" is used in many languages (at least C, C++ and Perl).
I would go for something like this:
Create a list of features for each language, these could be operators, commenting style (since most use some sort of easily detectable character or character combination).
For instance: Some languages have lines that start with the character "#", these include C, C++ and Perl. Do others than the first two use #include and #define in their vocabulary? If you detect this character at the beginning of line, the language is probably one of those. If the character is in the middle of the line, the language is most likely Perl.
Also, if you find the pattern := this would narrow it down to some likely languages.
Etc.
I would have a two-dimensional table with languages and patterns found and after analysis I would simply count which language had most "hits". If I wanted it to be really clever I would give each feature a weight which would signify how likely or unlikely it is that this feature is included in a snippet of this language. For instance if you can find a snippet that starts with /* and ends with */ it is more than likely that this is either C or C++.
The problem with keywords is someone might use it as a normal variable or even inside comments. They can be used as a decider (e.g. the word "class" is much more likely in C++ than C if everything else is equal), but you can't rely on them.
After the analysis I would offer the most likely language as the choice for the user with the rest ordered which would also be selectable. So the user would accept your guess by simply clicking a button, or he can switch it easily.
Vim has a autodetect filetype feature. If you download vim sourcecode you will find a /vim/runtime/filetype.vim file.
For each language it checks the extension of the file and also, for some of them (most common), it has a function that can get the filetype from the source code. You can check that out. The code is pretty easy to understand and there are some very useful comments there.
Very interesting question, I don't know if it is possible to be able to distinguish languages by code snippets, but here are some ideas:
:=
. Pascal has many unique keywords, too (begin, sub, end, ...)max()
, for example)#define boolean int
) So you can never guarantee, that you found the correct language.token_get_all()
for PHP - or third-party tools - like pychecker for python - to check the syntaxSumming it up: This project would make an interesting research paper (IMHO) and if you want it to work well, be prepared to put a lot of effort into it.