Detecting programming language from a snippet

2019-01-03 11:41发布

What would be the best way to detect what programming language is used in a snippet of code?

17条回答
ら.Afraid
2楼-- · 2019-01-03 12:14

First, I would try to find the specific keyworks of a language e.g.

"package, class, implements "=> JAVA
"<?php " => PHP
"include main fopen strcmp stdout "=>C
"cout"=> C++
etc...
查看更多
神经病院院长
3楼-- · 2019-01-03 12:15

Language detection solved by others:

Ohloh's approach: https://github.com/blackducksw/ohcount/

Github's approach: https://github.com/github/linguist

查看更多
唯我独甜
4楼-- · 2019-01-03 12:15

Best solution I have come across is using the linguist gem in a Ruby on Rails app. It's kind of a specific way to do it, but it works. This was mentioned above by @nisc but I will tell you my exact steps for using it. (Some of the following command line commands are specific to ubuntu but should be easily translated to other OS's)

If you have any rails app that you don't mind temporarily messing with, create a new file in it to insert your code snippet in question. (If you don't have rails installed there's a good guide here although for ubuntu I recommend this. Then run rails new <name-your-app-dir> and cd into that directory. Everything you need to run a rails app is already there).

After you have a rails app to use this with, add gem 'github-linguist' to your Gemfile (literally just called Gemfile in your app directory, no ext).

Then install ruby-dev (sudo apt-get install ruby-dev)

Then install cmake (sudo apt-get install cmake)

Now you can run gem install github-linguist (if you get an error that says icu required, do sudo apt-get install libicu-dev and try again)

(You may need to do a sudo apt-get update or sudo apt-get install make or sudo apt-get install build-essential if the above did not work)

Now everything is set up. You can now use this any time you want to check code snippets. In a text editor, open the file you've made to insert your code snippet (let's just say it's app/test.tpl but if know the extension of your snippet, use that instead of .tpl. If you don't know the extension, don't use one). Now paste your code snippet in this file. Go to command line and run bundle install (must be in your application's directory). Then run linguist app/test.tpl (more generally linguist <path-to-code-snippet-file>). It will tell you the type, mime type, and language. For multiple files (or for general use with a ruby/rails app) you can run bundle exec linguist --breakdown in your application's directory.

It seems like a lot of extra work, especially if you don't already have rails, but you don't actually need to know ANYTHING about rails if you follow these steps and I just really haven't found a better way to detect the language of a file/code snippet.

查看更多
等我变得足够好
5楼-- · 2019-01-03 12:16

I believe that there is no single solution that could possibly identify what language a snippet is in, just based upon that single snippet. Take the keyword print. It could appear in any number of languages, each of which are for different purposes, and have different syntax.

I do have some advice. I'm currently writing a small piece of code for my website that can be used to identify programming languages. Like most of the other posts, there could be a huge range of programming languages that you simply haven't heard, you can't account for them all.

What I have done is that each language can be identified by a selection of keywords. For example, Python could be identified in a number of ways. It's probably easier if you pick 'traits' that are also certainly unique to the language. For Python, I choose the trait of using colons to start a set of statements, which I believe is a fairly unique trait (correct me if I'm wrong).

If, in my example, you can't find a colon to start a statement set, then move onto another possible trait, let's say using the def keyword to define a function. Now this can causes some problems, because Ruby also uses the keyword def to define a function. The key to telling the two (Python and Ruby) apart is to use various levels of filtering to get the best match. Ruby use the keyword end to finish a function, whereas Python doesn't have anything to finish a function, just a de-indent but you don't want to go there. But again, end could also be Lua, yet another programming language to add to the mix.

You can see that programming languages simply overlay too much. One keyword that could be a keyword in one language could happen to be a keyword in another language. Using a combination of keywords that often go together, like Java's public static void main(String[] args) helps to eliminate those problems.

Like I've already said, your best chance is looking for relatively unique keywords or sets of keywords to separate one from the other. And, if you get it wrong, at least you had a go.

查看更多
成全新的幸福
6楼-- · 2019-01-03 12:19

It's very hard and sometimes impossible. Which language is this short snippet from?

int i = 5;
int k = 0;
for (int j = 100 ; j > i ; i++) {
    j = j + 1000 / i;
    k = k + i * j;
}

(Hint: It could be any one out of several.)

You can try to analyze various languages and try to decide using frequency analysis of keywords. If certain sets of keywords occur with certain frequencies in a text it's likely that the language is Java etc. But I don't think you will get anything that is completely fool proof, as you could name for example a variable in C the same name as a keyword in Java, and the frequency analysis will be fooled.

If you take it up a notch in complexity you could look for structures, if a certain keyword always comes after another one, that will get you more clues. But it will also be much harder to design and implement.

查看更多
Viruses.
7楼-- · 2019-01-03 12:19

It would depend on what type of snippet you have, but I would run it through a series of tokenizers and see which language's BNF it came up as valid against.

查看更多
登录 后发表回答