I was recently working on a github project in both JavaScript and C++, and noticed that github tagged the project as C++. If you have to pick a single language, this is probably the correct designation since the C++ code is compiled as a JavaScript library, but this made me wonder... how does github figure out what language to tag each project?
相关问题
- How to add working directory to deployment in GitH
- Upload file > 25 MB on Github
- Can I input Git command in Android Studio IDE?
- Source tree not able to push
- Git lost local commited files after git checkout
相关文章
- java开发bug问题:GitHub授权登录无法获取授权账号信息?
- Is there a Github markdown language identifier for
- “no implicit conversion of Integer into String” er
- GitHub:Enterprise post-receive hook
- git commit directory
- travis-ci setup releases with --github-token
- git commit gives error: empty commit set passed
- github部署的网站,引用的js文件报404
File extensions is the first thing that comes to my mind.
Currently, Github's linguist project is what is used to determine language statistics, as described in this Github blog post (which came out a few months after this question was originally asked).
Update April 2013, by nuclearsandwich (GitHub support team or "supportocat"):
the help page "My repository is marked as the wrong language" mentions using now the linguist library to determine file language for syntax highlighting and repo statistics. Linguist will exclude certain file names and paths from statistic, excluding certain vendor files and directories.
the help page "Why isn't my favorite language recognized?" adds:
(Original answer, Oct. 2012)
This thread on GitHub support explains it:
Since this is not 100% accurate, that had lead some to add:
Note: as Mark Rushakoff mentions in his answer (upvoted), the guessing got better since then with the linguist project (open-sourced from June 2011).
You can see there are still issues though: GitHub Linguist Issues.
See here for more details:
And you can add linguist directives in a .gitattributes file.
First, know that you can override the language detected for files in your repository using Linguist overrides.
Now, in a nutshell,
How does Linguist detect languages?
Linguist relies on the following strategies, in order, and returns the language as soon as it found a perfect match (strategy with a single language returned).
Makefile
).#!/bin/bash
shebang will be classified as Shell..h
) are refined by the subsequent strategies.^[^#]+:-
for Prolog).What are unvendored and documentation files?
Linguist considers some files as vendored, meaning they are not included in language statistics. These include third-party libraries such as jQuery and are defined in the
vendor.yml
configuration file. You can also vendor or unvendor files in your repository using Linguist overrides.Similarly, documentation files are defined in
documentation.yml
and can be changed using Linguist overrides.How are generated files detected?
Linguist relies on simple rules to detect generated files, using both the paths and the content of files. Generated files are not counted in language statistics and are not displayed in diffs on github.com.
What about programming and markup languages?
In Linguist, each language is given a type. These types can be found in the main configuration file,
languages.yml
. Only the programming and markup languages are counted in statistics.After some tinkering with linguist I have noticed this.
For files with a Shebang, the Shebang is considered when determining the language but seems to be evenly weighted against other tokens. This seems to be a big error because the Shebang should definitively define the language of the file.
This can cause issues with highlighting.