Detecting programming language from a snippet

2019-01-03 11:41发布

What would be the best way to detect what programming language is used in a snippet of code?

17条回答
Anthone
2楼-- · 2019-01-03 12:19

I needed this so i created my own. https://github.com/bertyhell/CodeClassifier

It's very easily extendable by adding a training file in the correct folder. Written in c#. But i imagine the code is easily converted to any other language.

查看更多
Luminary・发光体
3楼-- · 2019-01-03 12:21

Interesting. I have a similar task to recognize text in different formats. YAML, JSON, XML, or Java properties? Even with syntax errors, for example, I should tell apart JSON from XML with confidence.

I figure how we model the problem is critical. As Mark said, single-word tokenization is necessary but likely not enough. We will need bigrams, or even trigrams. But I think we can go further from there knowing that we are looking at programming languages. I notice that almost any programming language has two unique types of tokens -- symbols and keywords. Symbols are relatively easy (some symbols might be literals not part of the language) to recognize. Then bigrams or trigrams of symbols will pick up unique syntax structures around symbols. Keywords is another easy target if the training set is big and diverse enough. A useful feature could be bigrams around possible keywords. Another interesting type of token is whitespace. Actually if we tokenize in the usual way by white space, we will loose this information. I'd say, for analyzing programming languages, we keep the whitespace tokens as this may carry useful information about the syntax structure.

Finally if I choose a classifier like random forest, I will crawl github and gather all the public source code. Most of the source code file can be labeled by file suffix. For each file, I will randomly split it at empty lines into snippets of various sizes. I will then extract the features and train the classifier using the labeled snippets. After training is done, the classifier can be tested for precision and recall.

查看更多
Summer. ? 凉城
4楼-- · 2019-01-03 12:23

I think the biggest distinction between languages is its structure. So my idea would be to look at certain common elements across all languages and see how they differ. For example, you could use regexes to pick out things such as:

  • function definitions
  • variable declarations
  • class declarations
  • comments
  • for loops
  • while loops
  • print statements

And maybe a few other things that most languages should have. Then use a point system. Award at most 1 point for each element if the regex is found. Obviously, some languages will use the exact same syntax (for loops are often written like for(int i=0; i<x; ++i) so multiple languages could each score a point for the same thing, but at least you're reducing the likelihood of it being an entirely different language). Some of them might scores 0s across the board (the snippet doesnt contain a function at all, for example) but thats perfectly fine.

Combine this with Jules' solution, and it should work pretty well. Maybe also look for frequencies of keywords for an extra point.

查看更多
看我几分像从前
5楼-- · 2019-01-03 12:26

I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.

http://en.wikipedia.org/wiki/Bayesian_spam_filtering

If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.

I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:

print "Hello"

Let me find the code.

I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:

def foo
   puts "hi"
end

is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).

I hope it helps you:

class Classifier
  def initialize
    @data = {}
    @totals = Hash.new(1)
  end

  def words(code)
    code.split(/[^a-z]/).reject{|w| w.empty?}
  end

  def train(code,lang)
    @totals[lang] += 1
    @data[lang] ||= Hash.new(1)
    words(code).each {|w| @data[lang][w] += 1 }
  end

  def classify(code)
    ws = words(code)
    @data.keys.max_by do |lang|
      # We really want to multiply here but I use logs 
      # to avoid floating point underflow
      # (adding logs is equivalent to multiplication)
      Math.log(@totals[lang]) +
      ws.map{|w| Math.log(@data[lang][w])}.reduce(:+)
    end
  end
end

# Example usage

c = Classifier.new

# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)

# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)
查看更多
贪生不怕死
6楼-- · 2019-01-03 12:28

I wouldn't think there would be an easy way of accomplishing this. I would probably generate lists of symbols/common keywords unique to certain languages/classes of languages (e.g. curly brackets for C-style language, the Dim and Sub keywords for BASIC languages, the def keyword for Python, the let keyword for functional languages). You then might be able to use basic syntax features to narrow it down even further.

查看更多
登录 后发表回答