Parsing Source Code - Unique Identifiers for Diffe

2019-03-25 06:34发布

I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).

Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).

It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:

  • I'm receiving pure text and not file names, so I can't use the extension as a hint.
  • The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).
  • it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.

Examples:

  • If the text contains "x->y()", I may know for sure it's C++ (?)
  • If the text contains "public static void main", I may know for sure it's Java (?)
  • If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)

My question:

  1. Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?
  2. What are the unique code "tokens" with which I could certainly differentiate one language from another?

I'm writing my code in Python but I believe the question to be language agnostic.

Thanks

14条回答
男人必须洒脱
2楼-- · 2019-03-25 06:39

One program I know which even can distinguish several different languages within the same file is ohcount. You might get some ideas there, although I don't really know how they do it.

In general you can look for distinctive patterns:

  • Operators might be an indicator, such as := for Pascal/Modula/Oberon, => or the whole of LINQ in C#
  • Keywords would be another one as probably no two languages have the same set of keywords
  • Casing rules for identifiers, assuming the piece of code was writting conforming to best practices. Probably a very weak rule
  • Standard library functions or types. Especially for languages that usually rely heavily on them, such as PHP you might just use a long list of standard library functions.

You may create a set of rules, each of which indicates a possible set of languages if it matches. Intersecting the resulting lists will hopefully get you only one language.

The problem with this approach however, is that you need to do tokenizing and compare tokens (otherwise you can't really know what operators are or whether something you found was inside a comment or string). Tokenizing rules are different for each language as well, though; just splitting everything at whitespace and punctuation will probably not yield a very useful sequence of tokens. You can try several different tokenizing rules (each of which would indicate a certain set of languages as well) and have your rules match to a specified tokenization. For example, trying to find a single-quoted string (for trying out Pascal) in a VB snippet with one comment will probably fail, but another tokenizer might have more luck.

But since you want to perform analysis anyway you probably have parsers for the languages you support, so you can just try running the snippet through each parser and take that as indicator which language it would be (as suggested by OregonGhost as well).

查看更多
ゆ 、 Hurt°
3楼-- · 2019-03-25 06:39

Some thoughts:

$x->y() would be valid in PHP, so ensure that there's no $ symbol if you think C++ (though I think you can store function pointers in a C struct, so this could also be C).

public static void main is Java if it is cased properly - write Main and it's C#. This gets complicated if you take case-insensitive languages like many scripting languages or Pascal into account. The [] attribute syntax in C# on the other hand seems to be rather unique.

You can also try to use the keywords of a language - for example, Option Strict or End Sub are typical for VB and the like, while yield is likely C# and initialization/implementation are Object Pascal / Delphi.

If your application is analyzing the source code anyway, you code try to throw your analysis code at it for every language and if it fails really bad, it was the wrong language :)

查看更多
时光不老,我们不散
4楼-- · 2019-03-25 06:41

build a generic tokenizer and then use a Bayesian filter on them. Use the existing "user checks a box" system to train it.

查看更多
Fickle 薄情
5楼-- · 2019-03-25 06:41

There is no way of making this foolproof, but I would personally start with operators, since they are in most cases "set in stone" (I can't say this holds true to every language since I know only a limited set). This would narrow it down quite considerably, but not nearly enough. For instance "->" is used in many languages (at least C, C++ and Perl).

I would go for something like this:

Create a list of features for each language, these could be operators, commenting style (since most use some sort of easily detectable character or character combination).

For instance: Some languages have lines that start with the character "#", these include C, C++ and Perl. Do others than the first two use #include and #define in their vocabulary? If you detect this character at the beginning of line, the language is probably one of those. If the character is in the middle of the line, the language is most likely Perl.

Also, if you find the pattern := this would narrow it down to some likely languages.

Etc.

I would have a two-dimensional table with languages and patterns found and after analysis I would simply count which language had most "hits". If I wanted it to be really clever I would give each feature a weight which would signify how likely or unlikely it is that this feature is included in a snippet of this language. For instance if you can find a snippet that starts with /* and ends with */ it is more than likely that this is either C or C++.

The problem with keywords is someone might use it as a normal variable or even inside comments. They can be used as a decider (e.g. the word "class" is much more likely in C++ than C if everything else is equal), but you can't rely on them.

After the analysis I would offer the most likely language as the choice for the user with the rest ordered which would also be selectable. So the user would accept your guess by simply clicking a button, or he can switch it easily.

查看更多
forever°为你锁心
6楼-- · 2019-03-25 06:43

Vim has a autodetect filetype feature. If you download vim sourcecode you will find a /vim/runtime/filetype.vim file.

For each language it checks the extension of the file and also, for some of them (most common), it has a function that can get the filetype from the source code. You can check that out. The code is pretty easy to understand and there are some very useful comments there.

查看更多
手持菜刀,她持情操
7楼-- · 2019-03-25 06:44

Very interesting question, I don't know if it is possible to be able to distinguish languages by code snippets, but here are some ideas:

  • One simple way is to watch out for single-quotes: In some languages, it is used as character wrapper, whereas in the others it can contain a whole string
  • A unary asterisk or a unary ampersand operator is a certain indication that it's either of C/C++/C#.
  • Pascal is the only language (of the ones given) to use two characters for assignments :=. Pascal has many unique keywords, too (begin, sub, end, ...)
  • The class initialization with a function could be a nice hint for Java.
  • Functions that do not belong to a class eliminates java (there is no max(), for example)
  • Naming of basic types (bool vs boolean)
  • Which reminds me: C++ can look very differently across projects (#define boolean int) So you can never guarantee, that you found the correct language.
  • If you run the source code through a hashing algorithm and it looks the same, you're most likely analyzing Perl
  • Indentation is a good hint for Python
  • You could use functions provided by the languages themselves - like token_get_all() for PHP - or third-party tools - like pychecker for python - to check the syntax

Summing it up: This project would make an interesting research paper (IMHO) and if you want it to work well, be prepared to put a lot of effort into it.

查看更多
登录 后发表回答