PyMC: Estimating population parameters where each

2019-05-01 23:25发布

问题:

I have a list of n observations, each of which is the sum of two Weibull-distributed variables:

x[i] = t1[i] + t2[i]
t1[i] ~ Weibull(shape1, scale1)
t2[i] ~ Weibull(shape2, scale2)

My goal is:

1) Estimate the shape and scale parameters for both Weibull distributions (shape1, scale1, shape2, scale2),

2) For each observation x[i], estimate t1[i] (and t2[i] follows from this).

(Aside: Each observation x[i] is the age of cancer diagnosis, and t1[i] and t2[i] are two different time periods in the development of the tumor. The actual model involves mutation data as well, but before I try that out, I want to make sure that I can use PyMC for this simpler problem.)

I am using PyMC2 to make these estimates, and it looks like the run converges, but to incorrect results. I do not know whether there is a problem with my PyMC model syntax, with the MCMC settings, or both. I tried adapting this advice on using Potentials to model latent variables. First I define x[i] and t1[i] for each observation:

for i in xrange(n):
    x[i] = pm.Index('x_%i'%i, x=data, index=i) # data is a list of observations
    t1[i] = pm.Weibull('t1_%i'%i, alpha=shape1, beta=scale1)
    # Ensure that initial guess for t1 is not more than the observed sum:
    if t1[i].value >= x[i].value:
        t1[i].value = 0.95 * x[i].value

Then I define a Deterministic for t2[i] = x[i] - t1[i]:

for i in xrange(n):
    def subtractfunc(t1=t1, x=x, ii=i):
        return x[ii] - t1[ii]
    t2[i] = pm.Lambda('t2_%i'%i, subtractfunc)

And last I define the Potential for t2[i]:

t2dist = np.empty(n, dtype=object)
for i in xrange(n):
    def weibfunc(t2=t2, shape2=shape2, scale2=scale2, ii=i):
        return pm.weibull_like(t2[ii], alpha=shape2, beta=scale2)
    t2dist[i] = pm.Potential(logp = weibfunc,
                               name = 't2dist_%i'%i,
                               parents = {'shape2':shape2, 'scale2':scale2, 't2':t2},
                               doc = 'weibull potential for t2',
                               verbose = 0,
                               cache_depth = 2)

You can see my full code here. I test by simulating 60 independent observations, with shape1 = 1, scale1 = 30, shape2 = 6.5, scale2 = 10, and I run 1e5 iterations of AdaptiveMetropolis. The results converge to a mean of shape1=1.94, scale1=37.9, shape2=0.55, scale2=36.1, and the 95% HPDs do not include the true values. This resulting distribution is not even in the right ballpark, as this histogram shows. (Blue shows the simulated data x[i] that I used, while the red shows the completely different inferred distribution from a representative iteration in the MCMC run.)

Running again with a different random seed, I get shape1=4.65, scale1=23.3, shape2=0.83, scale2=21.3. This distribution is somewhat closer to the truth. Is there some way to change the MCMC settings to consistently get decent results for this sort of problem? Any advice about using PyMC more effectively is much appreciated.

Update -- tried an "assisted" MCMC run:

I also tried assisting the MCMC run by initializing population-level parameters with values close to the truth. The results are somewhat better, but I now find a systematic bias. The histogram below shows the true distribution of observations (blue) against the fitted distribution (red). The right tail fits nicely, but the fit fails to capture the sharp peak at the left side. This bias occurs consistently, for population sizes n = 60 and 100. I am not sure if this is more of a PyMC question or a general MCMC algorithm issue.

标签: python pymc mcmc