How can I use the languages YAML file to determine

2019-08-02 18:44发布

问题:

There is a page on GitHub Help describing how to use syntax-highlighted code blocks. On that page there are instructions describing how to match languages to their keywords for this purpose:

We use Linguist to perform language detection and syntax highlighting. You can find out which keywords are valid in the languages YAML file.

However, there's a lot of data in that YAML and I don't find it very clear how exactly one can use it to determine which keywords work for any given language.

I wrote a simple Boot script to attempt to parse this YAML to a more readable JSON file mapping from each language to its list of valid keywords:

curl https://raw.githubusercontent.com/github/linguist/f75c5707a62a3d66501993116826f4e64c3ca4dd/lib/linguist/languages.yml | ./languages.boot > languages.json

But I'm not at all convinced that this is correct. For instance, many of the keywords that my script produces include spaces, and I was under the impression that those would not work:

The content of a code fence is treated as literal text, not parsed as inlines. The first word of the info string is typically used to specify the language of the code sample, and rendered in the class attribute of the code tag.

What I'm looking for is an understanding of the "schema" of this YAML file, insomuch as it relates to the syntax highlighting in GitHub Markdown. Ideally I'd like to be able to use this understanding to write a program that takes in a languages YAML file and generates something like the list of language codes for Stack Exchange syntax highlighting, but for Markdown on GitHub. How can I write such a program?

回答1:

What I'm looking for is an understanding of the "schema" of this YAML file.

For each language in the languages.yml file, you can use as specifiers:

  1. the language name;
  2. any of the language aliases;
  3. any of the language interpreters;
  4. any of the file extensions, with or without a leading ..

White spaces must be replaced by dashes (e.g., emacs-lisp is one specifier for Emacs Lisp). Languages with a tm_scope: none entry don't have a grammar defined and won't be highlighted on github.com.

How can I write such a program?

Actually, someone already wrote such a program. In github/linguist#2278, jmm details the results of his investigation and received confirmation from one of GitHub's engineers (same thread). He also gives the link to his own program to compute identifiers and a wiki page with the results (which might not be up-to-date).