-->

PyMC regression of many regressions?

2019-09-16 19:24发布

问题:

I haven't been using PyMC for long, but I was pleased at how quickly I was able to get a linear regression off the ground (this code should run without modification in IPython):

import pandas as pd
from numpy import *
import pymc

data=pd.DataFrame(rand(40))
predictors=pd.DataFrame(rand(40,5))
sigma = pymc.Uniform('sigma', 0.0, 200.0, value=20)
params= array([pymc.Normal('%s_coef' % (c), mu=0, tau=1e-3,value=0) for c in predictors.columns])

@pymc.deterministic(plot=False)
def linear_regression_model(x=predictors,beta=params):
    return dot(x,beta)

ynode = pymc.Normal('ynode', mu=linear_regression_model, tau=1.0/sigma**2,value=data,observed=True)


m = pymc.Model(concatenate((params,array([sigma,ynode]))))

%time pymc.MCMC(m).sample(10000, 5000, 5, progress_bar=True)

In this model there are 40 subjects (observations) and 5 covariates for each subject. The model won't converge because of the random data, but it samples with no error (and my real data does converge to an accurate result).

The model I am having a problem with is an extension of this. Each subject actually has 3 (or N) observations and so I need to fit a line to these observations and then use the intercept of the line as the "data" or input for the final regression node. I think this a classical hierarchical model, but correct me if I'm thinking about it in the wrong way.

My solution was to set up a series of 40 linear regressions (one for each subject) and then use the vector of intercept parameters as the data for the final regression.

#nodes for fitting 3 values for each of 40 subjects with a line
#40 subjects, 3 data points each
data=pd.DataFrame(rand(40,3))
datax=arange(3)

"""
to fit a line to each subject's data we need:
    (1) a slope and offset parameter
    (2) a stochastic node for the data
    (3) a sigma parameter for the stochastic node

Initialize them all as object arrays
"""
sigmaArr=empty((len(data.index)),dtype=object)
ynodeArr=empty((len(data.index)),dtype=object)
slopeArr=empty((len(data.index)),dtype=object)
offsetArr=empty((len(data.index)),dtype=object)

#Fill in the empty arrays
for i,ID in enumerate(data.index):
    sigmaArr[i]=pymc.Uniform('sigma_%s' % (ID) , 0.0, 200.0, value=20)
    slopeArr[i]=pymc.Normal('%s_slope' % (ID), mu=0, tau=1e-3,value=0)
    offsetArr[i]=pymc.Normal('%s_avg' % (ID), mu=0, tau=1e-3,value=data.ix[ID].mean())

    #each model fits a line to the three data points
    @pymc.deterministic(name='time_model_%s' % ID,plot=False)
    def line_model(xx=datax,slope=slopeArr[i],avg=offsetArr[i]):
        return slope*xx + avg

    ynodeArr[i]=pymc.Normal('ynode_%s' % (ID), mu = line_model, tau = 1/sigmaArr[i]**2,value=data.ix[ID],observed=True)


#nodes for final regression (there are 5 covariates in this regression)
predictors=pd.DataFrame(rand(40,5))
sigma = pymc.Uniform('sigma', 0.0, 200.0, value=20)
params= array([pymc.Normal('%s_coef' % (c), mu=0, tau=1e-3,value=0) for c in predictors.columns])


@pymc.deterministic(plot=False)
def linear_regression_model(x=predictors,beta=params):
    return dot(x,beta)

ynode = pymc.Normal('ynode', mu=linear_regression_model, tau=1.0/sigma**2,value=offsetArr)

nodes=concatenate((sigmaArr,ynodeArr,slopeArr,offsetArr,params,array([sigma, ynode])))

m = pymc.Model(nodes)

%time pymc.MCMC(m).sample(10000, 5000, 5, progress_bar=True)

This model fails at the sampling step. The error appears to be in trying to cast offsetArr as dtype=float64 when instead its dtype=object. What is the correct way to do this? Do I need another deterministic node to help cast my offsetArr to float64? Do I need a special kind of pymc.Container? Thanks for your help!

回答1:

Have you tried using a simple list instead of numpy arrays to store the PyMC objects?