gitignore rules (applied to a codeigniter stack)

2019-08-18 07:00发布

问题:

I'm adding a CodeIgniter website into a new repo (as an initial add - the repo is currently empty).

I've moved the application and system directories, and index.php from the CodeIgniter root into a subdirectory, www. I wanted to ignore the development and production directories I'd created in application\config. https://www.gitignore.io/?templates=codeigniter uses */config/development/*; this didn't work for me, but

**/config/development/*

did, which has led to a number of questions:

  1. am I right in thinking that the */ at the beginning of */config/development/* only refers to any immediate subfolder?
  2. If so, what's the likely rational for gitignore.io including */ instead of explicitly using `/application/config/development/*'?
  3. Because(?) I'd moved application to a subfolder, I found that I needed to change */config/development/* to **/config/development/*
    • however, https://stackoverflow.com/a/41761521/889604 implies that **/ is redundant but when I tried changing it to config/development/* the files were still added. Do you only need to exclude *[*]/ when the directory is in the same directory as the .gitignore?

回答1:

TL;DR

In a .gitignore, patterns with leading or embedded slashes are treated specially. They are different from patterns that have no leading or embedded slash. So you may need **/config/development/* here because of the two embedded slashes.

Summary

To answer your questions in order:

  1. Yes.

  2. You'll have to ask whoever wrote those ignore files.

  3. As I noted in a comment, the suppositions in Are leading asterisks "**/" redundant in .gitignore path matching syntax? are wrong; the accepted answer there is not applicable to this case.

An explanation of this last item seems appropriate here.

Long

For no obvious reason, Git's rule about whether a .gitignore pattern matches some file name found during some directory-tree-walk has a peculiar wrinkle. If a pattern does not have an embedded slash, it's treated one way. If it does have an embedded slash, it's treated a different, second way. To really understand this, we need to define a few terms.

As a side note, I tend to like the term directory when talking about the OS-provided entities. If you prefer the term folder, you can substitute that in mentally—they're essentially the same thing. If you're familiar with C or Python or related languages, though, you'll know about opendir and/or readdir and/or os.listdir, and functions like os.walk, most of which also use the word "directory" to describe these things.

Defining glob patterns

Let's start with the .gitignore entries, which are made up of extended glob patterns. The term glob pattern is pretty well defined at this Wikipedia page, but we could use a bit more.

The most basic form of glob just has *, ?, and [...] meta-characters. A single question mark matches one character in a file name. An asterisk matches any number of characters (including zero characters), and a square-bracketed string matches any of the characters inside the brackets.1 Note that this kind of simple, basic glob is applied only to files within a single directory. Whatever entity is working with this kind of glob, that entity reads a list of file names—probably from an actual directory—and then selects those names within that directory that match that glob expression.

Obviously, the next level up is to add directories to this kind of simple glob. For instance, we might write dir/* to mean all files within the directory named dir. This is not very complicated, though it brings up a question we ignored with the simplest case: does a glob pattern match a directory name? That is, what if dir/sub is itself a directory—does dir/* match it? For that matter, does * match dir? The typical answer is that, yes, this does match, and as long as we're sticking with dir/* that just means that dir/sub gets matched (as a directory).

Extended globs vary a lot. Bash has its own special extended glob syntax, using globstar to enable ** and extglob to enable even more. What ** itself means varies: some implementations require it to match at least one directory, but allow any number of directory levels. Other implementations allow ** to match no directories, so that **/sub matches dir/sub but also just plain sub. Git's ** largely behaves this last way: it matches zero or more directories, according to the gitignore documentation.


1Note that despite the resemblance, glob patterns are not at all the same as regular expressions, where typically . means any single character—the equivalent of ? in glob—and * is a suffix operator, meaning zero-or-more of whatever came before. Hence in regular expressions, .* means zero-or-more of any character. R.E. square brackets usually allow for both ranges and inversion, e.g., [^a-z] means anything not in a through z, while shell glob patterns usually allow only ranges.


Git stores files via Git's index

In an important way, Git doesn't care about directories. In particular, Git commits store files, rather than storing directories full of files. The files simply have path names that look like they occupy directories. The OS demands that the directory dir exist, so that dir/sub can exist; dir/sub in turn must be a directory so that dir/sub/file can exist. But as far as Git is concerned, Git just needs to store content to go into a file that will be named dir/sub/file. When it comes time to write that content into that file, Git will simply create dir and dir/sub if needed, at that time. The presence or absence of the directories is irrelevant.

This is why you can't store an empty directory in a Git repository: Git stores the contents of files under file names in each commit. With no files, there's nothing to store, so empty directories just are not present in a commit.

Nonetheless, while Git stores only files, Git must use the OS-provided directory reading services in order to find the files you have put into your work-tree. Git will then copy those files—or more precisely, their contents, associated with their (full) names such as dir/sub/file, into Git's index when you prepare a new commit. The index holds each file's name, mode (100644 or 100755), and the hash ID of the Git-ified content. That's what will go into the next commit you make. (When you git checkout some existing commit, Git fills the index from that commit, so that the index initially matches the commit.)

Walking a directory tree

As we just saw, Git has to open and read each directory in your work-tree, starting with the top level of the work-tree itself. The results of calling os.listdir (Python) or opendir and readdir (C) is a list of names: file and sub-directory names within the directory that Git just told the OS to enumerate. A bit more work (calling lstat) gets the rest of the information required, and now Git knows whether the name dir refers to an ordinary file, or to a directory.

Given the name of a directory, Git is generally going to have to open and read that directory as well. So Git will open and read dir and find the name sub, and discover that sub is a directory. Git will then open and read dir/sub and find the name file, and that file names a file. This process of opening and reading, recursively, each directory within a directory, is called walking the directory tree. That's what the Python os.walk function does, for instance.

Standard C does not have a function for walking a tree, so Git implements it all by hand, as it were. This starts to matter in a moment, but for now, think of it this way: by walking the tree, Git finds all the directories and all the files in the repository. Absent .gitignore, Git throws away all the directory names, keeps all the file names—using their full paths from the top—and then, at least for an "all" add operation, puts all those names and updated contents into the index, ready for the next commit.

There are several things to know about this:

  • The walking process is inherently recursive. That is, upon finding a directory, we must open and read the directory, handling each entry. If the entry is itself a directory, we must open and read that directory, and so on.

  • Meanwhile, each entry in a directory is just a name: we—or Git—must assemble the path as we go. If we're working on dir and come across sub, the full name is now dir/sub. If we're working on dir/sub and come across file, the full name is now dir/sub/file. But dir just lists sub, and sub itself just lists file. It's up to us / Git to remember the path.

  • The walking process is slow, relatively speaking. Git wants to be fast!

All of these introduce some of the complexities in .gitignore rules.

Gitignore files may exist at each level and list names and/or glob patterns

At the top level you can have a very simple .gitignore file:

# ignore files named *.o and *.pyc
*.o
*.pyc

Now Git can walk through your work-tree, finding files in each level of directory. If the file's name—as expressed in that directory, at whatever level—matches any of these simple glob patterns, and the full path name of that file is not already in the index, Git will pretend that the file does not exist: it won't get automatically added, and git status won't complain about it being untracked.

But what if we want to prevent the file dir/foo and dir/sub/foo from going into the top level, while not protecting against foo in the top level? Then we can tell Git: only ignore foo when it's contained within in dir. There is an easy way to express this: create the file dir/.gitignore. File names listed here are ignored when they're found by reading dir or any of its sub-directories:

.gitignore:
    *.o
    *.pyc
dir/.gitignore:
    foo

Now, during the walk, when Git opens and reads dir, it notices that there is a dir/.gitignore. It applies the rules there to all files found during this recursive traversal: they apply to files in dir and files in dir/sub, but not to files in the top level, nor—if there's a top level other/ directory, to files in there either.

Leading and embedded slashes avoid recursive matches

But what if we want to ignore only dir/foo, not dir/sub/foo, and not other/foo or /foo? Now we have a different problem, and Git provides two solutions. One of them is to write /foo as the entry in dir/.gitignore:

.gitignore:
    *.o
    *.pyc
dir/.gitignore:
    /foo

This ignores only dir/foo, not dir/sub/foo. It contains a leading slash, which tells Git: Don't apply this to sub-directories.

Another way to express this is to put this right into the top level .gitignore, which removes the need to have a dir/.gitignore at all:

*.o
*.pyc
dir/foo

This contains an embedded slash. When Git is doing a directory walk, it naturally finds file names stripped of their paths—it finds foo, not dir/foo, inside dir when walking through dir. So this kind of pattern is handled separately, after putting together the full path name.

So, this is the source of the first two special rules about slashes in names or patterns in .gitignore files:

  • A leading slash means match only this name or simple glob in this directory.
  • An embedded slash means match only this full path name or (extended) glob relative to this directory.

Note that the second case covers the first one: both will work correctly, matching only paths within this directory, once the relative path names are put together (i.e., after sub's foo is turned into dir/sub/foo). But we need the first case because a bare name or glob pattern, such as foo or *.pyc, would apply to this directory and all of its sub-directories. We could handle dir/foo by moving up to the top level and ignoring dir/foo directly, but if we want to ignore /bar without ignoring dir/bar and dir/sub/bar, we have only the top level .gitignore for this path.

This means you can invoke the full-path match—well, "full" relative to the directory in which the .gitignore itself lives—using a leading slash, an embedded slash, or both. In general, if you create the .gitignore file as close as possible to the file, you'll need the leading slash rule. If you use higher level .gitignore files, the embedded-slash rule suffices.

(The embedded slash rule might actually be a bug. The wording in the gitignore documentation suggests that dir/sub is meant to ignore a/dir/sub as well, and that you would have to write /dir/sub to not ignore a/dir/sub. But testing shows that it behaves the way I describe here:

$ git status -s -uall
?? a/dir/sub/file2
?? dir/sub/file
$ echo dir/sub > .gitignore
$ git status -s -uall
?? .gitignore
?? a/dir/sub/file2
$ git --version
git version 2.20.1

Note that ignoring dir/sub made file disappear, but a/dir/sub/file2 remains complained-about.)

Trailing slashes are different

Remember that we said that the tree-walk is slow, relatively speaking. It's pretty common to find a Git repository where, in the work-tree, we deliberately add an entire vendor SDK or other packaged thing—maybe taken from the repository as a single tarball, or maybe extracted in some method outside Git entirely—and never want to commit any of the files from inside this packaged thing, whatever it is. Having Git walk through every level of that package, once it's unarchived, is just a waste of time.

To this end, if Git doens't already have an index entry listing, say, dir/sub/vendor/file, and—during one of its ambles through directory trees—comes across the directory named vendor in dir/sub, you can tell Git: Don't bother to look inside this vendor/ directory at all. One way to express this is to use what we already know:

.gitignore:
    *.o
    *.pyc
    dir/sub/vendor

or:

.gitignore:
    *.o
    *.pyc
dir/sub/.gitignore:
    /vendor

We already know what the leading slash is for here: it makes sure we only ignore vendor in dir/sub. That's also already the case for the top level .gitignore.

However, what if we want to skip all directories named vendor, without skipping any files named vendor? Here, we can use the trailing slash syntax:

.gitignore:
    *.o
    *.pyc
    vendor/

This vendor/ looks like dir/sub in some ways. But the slash here is not embedded, it's trailing. So this slash does not turn on the full-path-only code. Instead, it tells Git: During your tree-walk, when you come across something named vendor, and it's a directory, don't bother recursing into it. The trailing slash is first removed from this string, leaving vendor is the item to match. That has neither a leading slash, nor an embedded one, so it's matched at any sub-level of this level of the walk—but it does actually have a trailing slash, so it's matched only if what's actually in the tree is a directory.

Of course, we can also just say vendor, or v*r, or any other thing that matches vendor, if we're willing to ignore files as well. Or we can write v*r/ if we want to ignore all directories whose base-name—the part without the full path—matches v*r.

Un-ignoring a previous ignore rule, and the problem with ignored directories

Any entry in .gitignore that starts with ! overrides a previous ignore rule that also matched this entry. Note, however, for this to occur, Git needs to have opened and read the directory during its tree-walk. If an earlier ignore rule allows Git to ignore a directory, Git does that during the tree-walk phase.

That is, if there's any rule that matches vendor at any point, and that rule says do ignore this, and vendor is a directory, Git won't open vendor and read its contents. It won't see vendor/file1, vendor/file2, and so on. Those names will never be brought under the should we ignore this name microscope, neither in their base-name file1 format, nor in their dir/sub/vendor/file1 full-path format.

Conclusions: what you should know about .gitignore

  • A leading slash has an anchoring effect. The anchor is at the same level as the .gitignore file. (If the ignore file is outside the work-tree—e.g., is in $HOME/.gitignore or .git/info/exclude—the anchoring level is the top level of the work-tree.)

  • Embedded slashes—but not a trailing slash—turn on the anchoring effect too, despite the documentation's vague implied hint otherwise. This might be a bug, but Git has behaved this way consistently through many releases (so maybe it's a documentation bug).

  • Double-star glob matching (**/whatever) contains an embedded slash, almost by definition. The only two double-star globs that do not have an embedded slash are **/ and **, neither of which is likely to be used in practice. Embedded slashes anchor names, but the double-star allows zero or more directory levels here, so that the anchoring has no inhibitory effect. The leading double-star is required if you want this kind of free-floating match behavior on a name that, without the leading **/, would also contain an embedded slash.

  • Un-ignore rules require that Git open and read a directory. If you want to un-ignore some file deep in the directory tree, you want none of its containing directories to be ignored, or to find that something forces Git to scan the deep subdirectory. That is, if you have a file named long/path/to/important/file and you want that file to be stored in each commit, you'll need that name to get into Git's index, so that Git will store it in the next commit.

  • Files that exist in the index are, by definition, not ignored. Ignore entries apply only to files that aren't in the index, but are in the work-tree.

  • The index (always) exists, and it holds file names that—because the OS insists—actually appear inside directories. So if the index has a long/path/to/important/file, Git will check to see if long/path/to/important/file is still there and has or has not been modified. But if you've ignored long, or long/path/to/important, or something along the way here, Git won't read the directory.2 If you somehow accidentally remove long/path/to/important/file from the index while ignoring the directory long/path/to/important, Git won't add the file back again by itself, nor will it warn you that the work-tree file has become an untracked file.


2You can add a file that would otherwise be ignored using git add -f, and you can have a set of files in directories that aren't ignored, add some of those files to the index, then modify .gitignore to ignore their containing directories. All of these result in files in the index that would not have gotten there by a more direct, or less forceful (add -f), method. These are the cases I consider concerning: they are not wrong but they fall afoul of this last bullet-point.