In recent conversations with fellow students, I have been advocating for avoiding globals except to store constants. This is a sort of typical applied statistics-type program where everyone writes their own code and project sizes are on the small side, so it can be hard for people to see the trouble caused by sloppy habits.
In talking about avoidance of globals, I'm focusing on the following reasons why globals might cause trouble, but I'd like to have some examples in R and/or Stata to go with the principles (and any other principles you might find important), and I'm having a hard time coming up with believable ones.
- Non-locality: Globals make debugging harder because they make understanding the flow of code harder
- Implicit coupling: Globals break the simplicity of functional programming by allowing complex interactions between distant segments of code
- Namespace collisions: Common names (x, i, and so forth) get re-used, causing namespace collisions
A useful answer to this question would be a reproducible and self-contained code snippet in which globals cause a specific type of trouble, ideally with another code snippet in which the problem is corrected. I can generate the corrected solutions if necessary, so the example of the problem is more important.
Relevant links:
One quick but convincing example in R is to run the line like:
I chose 'normal' as something someone might choose, but you could use anything there.
Now run any code that uses generated random numbers, for example:
Then you can point out that the same thing could happen for any global variable.
I also use the example of:
Then ask the students what the value of
x
is; the answer is that we don't know.Here's one attempt at an answer that would make sense to statisticsy types.
First we define a log likelihood function,
Now we write an unrelated function to return the sum of squares of an input. Because we're lazy we'll do this passing it y as a global variable,
Our log likelihood function seems to behave exactly as we'd expect, taking an argument and returning a value,
But what's up with our other function?
Of course, this is a trivial example, as will be any example that doesn't exist in a complex program. But hopefully it'll spark a discussion about how much harder it is to keep track of globals than locals.
An example sketch that came up while trying to teach this today. Specifically, this focuses on trying to give intuition as to why globals can cause problems, so it abstracts away as much as possible in an attempt to state what can and cannot be concluded just from the code (leaving the function as a black box).
The set up
Here is some code. Decide whether it will return an error or not based on only the criteria given.
The code
The criteria
Case 1:
f()
is a properly-behaved function, which uses only local variables.Case 2:
f()
is not necessarily a properly-behaved function, which could potentially use global assignment.The answer
Case 1: The code will not return an error, since line one checks that there are no
x
's equal to zero and line three divides byx
.Case 2: The code could potentially return an error, since
f()
could e.g. subtract 1 fromx
and assign it back to thex
in the parent environment, where anyx
element equal to 1 could then be set to zero and the third line would return a division by zero error.Oh, the wonderful smell of globals...
All of the answers in this post gave R examples, and the OP wanted some Stata examples, as well. So let me chime in with these.
Unlike R, Stata does take care of locality of its local macros (the ones that you create with
local
command), so the issue of "Is this this a global z or a local z that is being returned?" never comes up. (Gosh... how can you R guys write any code at all if locality is not enforced???) Stata has a different quirk, though, namely that a non-existent local or global macro is evaluated as an empty string, which may or may not be desirable.I have seen globals used for several main reasons:
Globals are often used as shortcuts for variable lists, as in
I suspect that the main usage of such construct is for someone who switches between interactive typing and storing the code in a do-file as they try multiple specifications. Say they try regression with homoskedastic standard errors, heteroskedastic standard errors, and median regression:
And then they run these regressions with another set of variables, then with yet another one, and finally they give up and set this up as a do-file
myreg.do
withto be accompanied with an appropriate setting of the global macro. So far so good; the snippet
produces the desirable results. Now let's say they email their famous do-file that claims to produce very good regression results to collaborators, and instruct them to type
What will their collaborators see? In the best case, the mean and the median of
mpg
if they started a new instance of Stata (failed coupling:myreg.do
did not really know you meant to run this with a non-empty variable list). But if the collaborators had something in the works, and too had a globalmyvars
defined (name collision)... man, would that be a disaster.Globals are used for directory or file names, as in:
God only knows what will be loaded. In large projects, though, it does come handy. You would want to define
global mydir
somewhere in your master do-file, may be even asGlobals can be used to store an unpredictable crap, like a whole command:
God only knows what will be executed. This is the worst case of implicit strong coupling, but since I am not even sure that
RunThis
will contain anything meaningful, I put acapture
in front of it, and will be prepared to treat the non-zero return code_rc
. (See, however, my example below.)Stata's own use of globals is for God settings, like the type I error probability/confidence level: the global
$S_level
is always defined (and you must be a total idiot to redefine this global, although of course it is technically doable). This is, however, mostly a legacy issue with code of version 5 and below (roughly), as the same information can be obtained from less fragile system constant:Thankfully, globals are quite explicit in Stata, and hence are easy to debug and remove. In some of the above situations, and certainly in the first one, you'd want to pass parameters to do-files which are seen as the local
`0'
inside the do-file. Instead of using globals in themyreg.do
file, I would probably code it asThe
unab
thing will serve as an element of protection: if the input is not a legal varlist, the program will stop with an error message.In the worst cases I've seen, the global was used only once after having been defined.
There are occasions when you do want to use globals, because otherwise you'd have to pass the bloody thing to every other do-file or a program. One example where I found the globals pretty much unavoidable was coding a maximum likelihood estimator where I did not know in advance how many equations and parameters I would have. Stata insists that the (user-supplied) likelihood evaluator will have specific equations. So I had to accumulate my equations in the globals, and then call my evaluator with the globals in the descriptions of the syntax that Stata would need to parse:
where
lf
was the objective function (the log-likelihood). I encountered this at least twice, in the normal mixture package (denormix
) and confirmatory factor analysis package (confa
); you canfindit
both of them, of course.One R example of a global variable that divides opinion is the
stringsAsFactors
issue on reading data into R or creating a data frame.This can't really be corrected because of the way options are implemented in R - anything could change them without you knowing it and thus the same chunk of code is not guaranteed to return exactly the same object. John Chambers bemoans this feature in his recent book.
In R you may also try to show them that there is often no need to use globals as you may access the variables defined in the function scope from within the function itself by only changing the enviroment. For example the code below