Detecting programming language from a snippet

2019-01-03 11:41发布

What would be the best way to detect what programming language is used in a snippet of code?

17条回答
【Aperson】
2楼-- · 2019-01-03 12:06

Prettify is a Javascript package that does an okay job of detecting programming languages:

http://code.google.com/p/google-code-prettify/

It is mainly a syntax highlighter, but there is probably a way to extract the detection part for the purposes of detecting the language from a snippet.

查看更多
我欲成王,谁敢阻挡
3楼-- · 2019-01-03 12:07

Nice puzzle.

I think it is imposible to detect all languages. But you could trigger on key tokens. (certain reserved words and often used character combinations).

Ben there are a lot of languages with similar syntax. So it depends on the size of the snippet.

查看更多
淡お忘
4楼-- · 2019-01-03 12:07

Set up the random scrambler like

matrix S = matrix(GF(2),k,[random()<0.5for _ in range(k^2)]); while (rank(S) < k) : S[floor(k*random()),floor(k*random())] +=1;
查看更多
Rolldiameter
5楼-- · 2019-01-03 12:08

Guesslang is a possible solution:

http://guesslang.readthedocs.io/en/latest/index.html

There's also SourceClassifier:

https://github.com/chrislo/sourceclassifier/tree/master

I became interested in this problem after finding some code in a blog article which I couldn't identify. Adding this answer since this question was the first search hit for "identify programming language".

查看更多
Juvenile、少年°
6楼-- · 2019-01-03 12:10

You might find some useful material here: http://alexgorbatchev.com/wiki/SyntaxHighlighter. Alex has spent a lot of time figuring out how to parse a large number of different languages, and what the key syntax elements are.

查看更多
对你真心纯属浪费
7楼-- · 2019-01-03 12:10

An alternative is to use highlight.js, which performs syntax highlighting but uses the success-rate of the highlighting process to identify the language. In principle, any syntax highlighter codebase could be used in the same way, but the nice thing about highlight.js is that language detection is considered a feature and is used for testing purposes.

UPDATE: I tried this and it didn't work that well. Compressed JavaScript completely confused it, i.e. the tokenizer is whitespace sensitive. Generally, just counting highlight hits does not seem very reliable. A stronger parser, or perhaps unmatched section counts, might work better.

查看更多
登录 后发表回答