I came up with a strange result when doing my homework in R, can anyone explain to me what's going on?
The instruction told me to set seed 1 to keep consistency.
At first, I set seed(1) twice
set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
set.seed(1)
epsilon <- rnorm(100, mean = 0, sd = 0.25)
y <- 0.5 * x + epsilon -1
plot(x,y,main = "Scatter plot between X and Y", xlab = "X", ylab = "Y")
I get scatter plot like this: The plot with two set seed
After I only use one set seed the code is:
set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
epsilon <- rnorm(100, mean = 0, sd = 0.25)
y <- 0.5 * x + epsilon -1
plot(x,y,main = "Scatter plot between X and Y", xlab = "X", ylab = "Y")
The plot became reasonable: The plot with one set seed
Can anyone explain to me why two results are different by adding an extra "set.seed(1)"?
Set.seed() determines the random numbers that will be generated afterwards. In general it is used to create reproducible examples, so that if we both run the same code, we get the same results. To illustrate:
So as you can see, when you set.seed(x) twice with the same number, you are generating the same random numbers from that point on. (For variables with the same distribution. For others, see the elaboration below). So the reason you are getting a straight line in the first plot, is because
actually becomes
because you are using the same sequence of random numbers two times. That reduces to
And that is a simple linear equation.
So in general, you should only perform
set.seed(x)
once, at the beginning of your script.Elaboration on the comment: "But I generated the Epsilon with different sd, why would that still be the same x, although the plot seems to agree with the explanation?"
That's actually a really interesting question. Random numbers with distribution
~N(mean,sd)
are usually generated as follows:sd * X + mean
When you run this twice with the same seed but a different mean and sd, the first two steps will create exactly the same results, since the random numbers generated are the same, and the mean and sd are not used yet. Only in the third step do the mean and sd come into play. We can easily verify this:
Indeed, the random numbers generated the second time are exactly 0.25 times the numbers generated the first time.
So in my explanation above, epsilon is actually 0.25*x, and your resulting function is
y <- 0.75 * x - 1
, which is still just a linear function.Why the results were different - When set.seed is set once and run twice -
Whereas when set.seed is set again the results are -
So, when the seed is set only once, the program uses the next set of available numbers for generating the next set of random numbers