Consider the following Matlab code in which I generate some data using pseudo random number generator. I would like your help to understand "how" random are these numbers from a statistical point of view, in the terms I explain below.
I first set some parameters
%%%%%%%%Parameters
clear
rng default
Xsup=-1:6;
Zsup=1:10;
n_m=200;
n_w=200;
R=n_m;
Then I generate the data
%%%%%%%%Creation of data [XZ,etapair,zetapair,etasingle,zetasingle]
%Vector X of dimension n_mx1
idX=randi(size(Xsup,2),n_m,1); %n_mx1
X=Xsup(idX).'; %n_mx1
%Vector Z of dimension n_wx1
idZ=randi(size(Zsup,2),n_w,1);
Z=Zsup(idZ).'; %n_wx1
%Combine X and Z in a matrix XZ of dimension (n_m*n_w)x2
which lists all possible combinations of values in X and Z
[cX, cZ] = ndgrid(X,Z);
XZ = [cX(:), cZ(:)]; %(n_m*n_w)x2
%Vector etapair of dimension (n_m*n_w)x1
etapair=randn(n_m*n_w,1); %(n_m*n_w)x1
%Vector zetapair of dimension (n_m*n_w)x1
zetapair=randn(n_m*n_w,1); %(n_m*n_w)x1
%Vector etasingle of dimension (n_m*n_w)x1
etasingle=max(randn(n_m,R),[],2); %n_mx1
etasingle=repmat(etasingle, n_w,1); %(n_m*n_w)x1
%Vector zetasingle of dimension (n_m*n_w)x1
zetasingle=max(randn(n_w,R),[],2); %n_wx1
zetasingle=kron(zetasingle, ones(n_m,1)); %(n_m*n_w)x1
Let me now translate these draws into statistical terms:
For t=1,...,n_w*n_m
, X(t)
can be thought as a realisation of a random variable X_t
For t=1,...,n_w*n_m
, Z(t)
can be thought as a realisation of a random variable Z_t
For t=1,...,n_w*n_m
, etapair(t)
can be thought as a realisation of a random variable E_t
For t=1,...,n_w*n_m
, zetapair(t)
can be thought as a realisation of a random variable Q_t
For t=1,...,n_w*n_m
, etasingle(t)
can be thought as a realisation of a random variable Y_t
For t=1,...,n_w*n_m
, zetasingle(t)
can be thought as a realisation of a random variable S_t
My belief was that the pseudo random number generator in Matlab allows to claim that
(X_1,X_2,..., Z_1,Z_2,...,E_1,E_2,..., Q_1,Q_2...,Y_1,Y_2,...,S_1,S_2,...)
are mutually independent
as explained here
As a check of this hypothetical claim, I define W_t:=-E_t-Q_t+Y_t+S_t
and empirically compute Pr(W_t<=1|X_t=5, Z_t=1)
If mutual independence holds, then Pr(W_t<=1|X_t=5, Z_t=1)=Pr(W_t<=1)
and their empirical counterparts below, named option1
and option2
, should be ALMOST the same.
%option 1
num1=zeros(n_m*n_w,1);
for h=1:n_m*n_w
if -etapair(h)-zetapair(h)+etasingle(h)+zetasingle(h)<=1 && XZ(h,1)==5 && XZ(h,2)==1
num1(h)=1;
end
end
den1=zeros(n_m*n_w,1);
for h=1:n_m*n_w
if XZ(h,1)==5 && XZ(h,2)==1
den1(h)=1;
end
end
option1=sum(num1)/sum(den1);
%option 2
num2=zeros(n_m*n_w,1);
for h=1:n_m*n_w
if -etapair(h)-zetapair(h)+etasingle(h)+zetasingle(h)<=1
num2(h)=1;
end
end
option2=sum(num2)/(n_m*n_w);
Question: the difference between option1
(=0.0021) and option2
(=0.0012) is referred to the "ALMOST" or I am doing something wrong?
By the very nature of observing random events, you cannot guarantee theoretically accurate results for a give empirical trial.
You have set
rng default
at the start of your script, which means you will always get the same result (option1 = 0.0021
,option2 = 0.0012
).Running your script many times and averaging the results, we should approach theoretical accuracy.
Results:
We can see these are the same to some degree of accuracy, which can be arbitrarily high if we run X trials (for large enough X). This is as expected for independent variables.
Note, if you have the parallel computing toolbox, this
for
loop can easily be swapped for aparfor
, and you can run trials many times faster.