Comparing a model distribution with Observations with detections and upper limits

 Posted by Gaspar, Andras at July 09. 2013

Dear Experts,

I am currently working on a sample, with a limited number of detections and many upper limits.
I also have a Monte Carlo code output to compare the observations with. I have been using ASURV
with some interesting results and I would like to ask for some assistance.

I have a sample of X number sources; about half of the sample has upper limits, while the other half
has detections. I would like to compare the observed sample with the modeled distribution from the
MC code.

I have been using ASURV to do this, as I have a censored dataset. The biggest issue is, that the
model data itself is not censored, thus the TWO sample test is not perfect. I have tried artificially
censoring the model data with similar patterns as the observations have, but I do not think that is
the right way to go.

I have tried the following scenarios:

1) Setting all observations below a limit to an upper limit value equal to the cutoff limit, and same with
the model data. This is too well constrained, and artificially by me.

2) Setting all observations as upper limits and not censoring the model data at all.

3) Separating observations to detections and upper limits as they originally were and not censoring the
model data at all.

4) Separating observations to detections and upper limits as they originally were and mimicking the
censoring pattern for the model data. Only the Peto & Peto Generalized Wilcox gave a solution with this
method. I don’t understand why the others did not work.

——–

I am currently considering using the Kaplan-Meier estimator, as you can compare non-censored distributions
with censored data using it. Using ASURV, I have been able to plot the KM estimator of both the observed dataset
and the model distribution, however, I am having a hard time quantifying the probability that the the
observed KM distribution is drawn from the model KM distribution. I have seen on Wikipedia that a logrank test
can do this, but I am quite certain that the logrank test in ASURV is for two observed samples with censored data.

Any help is appreciated.

 Posted by Cameron, Ewan at July 14. 2013

Hi Gaspar,

Goodness-of-fit type tests are not easy to devise for censored data; and in particular, there is no result as ‘general’ as the ordinary K-S test for non-censored data.  So I am not surprised that you find some conflicting results from your initial investigations.

A couple of points it is worth clarifying.  (1) The idea of “quantifying the probability that the
observed KM distribution is drawn from the model KM distribution” is ill-defined in the sense that this particular statement can only be made relative to some alternative distribution or family of distributions. The usual direction of the K-S test is to establish the confidence level at which we can *reject* the model distribution as having generated the observed.  In this sense you could well make up your own test statistic for the problem at hand (e.g. the K-S statistic computed from only the non-censored data; or the above added to the fraction of censored datapoint; or something else just as odd) … the point being to then derive a rejection region for this statistic by sampling and censoring from your model distribution.

(2) I have the impression that you are simultaneously trying to constrain some parameters for your model here too.  In this case I would recommend Approximate Bayesian Computation (cf. my paper on this; or the better-written one by Weyant et al.), which is designed for cases where you do not know the analytic form of the likelihood function for the observed data given your model, but you can simulate mock datasets from that model for comparison to the observations.

Hope that helps,

Ewan.

 Posted by Gaspar, Andras at July 15. 2013

Hi Ewan

Thank you for your reply and for elaborating on certain things. What you write is exactly what I have been converging to: Goodness-of-fit is not easy for model vs. observed censored data.

I think I have devised to do exactly what you are recommending, which is basically doing a KS statistics on the non-censored part of the distribution, with the knowledge of the number of points below the detection threshold. The answers I have been getting this way are reasonable.

Although the model has multiple variables, I am really only trying to constrain one of them; although we do have one computation where we look at two variables. I’ll look into your Bayesian computation!

Thank you for your help,

Andras

 Posted by Gaspar, Andras at July 15. 2013

… slight clarification …

What I devised doing was a KS test on the KM corrected dataset, looking only at points above the detection threshold. The KM estimate yields a sort-of “completeness corrected” cumulative distribution function, thus performing the KS test on it should be OK. What are your thoughts on this?

Thank you

 Posted by Cameron, Ewan at July 15. 2013

Sounds fine: it’s basically doing a K-S test of x ~[iid] f_obs(x|x>x_lim) against f_mod(x|x>x_lim) … still able to reject the hypothesis that f_obs is equivalent to f_mod, just at lower power than we would have had from the full dataset without truncation.  Someone else might still be able to suggest a more powerful test using also the truncated datapoints … but I would imagine it’s going to carry some stronger assumptions about f_obs and f_mod.

 Posted by Gaspar, Andras at August 21. 2013

Hi Cameron,

Thank you for your assistance. I just wanted to let you know that we acknowledge your help in our submitted paper (http://lanl.arxiv.org/abs/1308.1954).

-Andras

 Posted by Gaspar, Andras at August 21. 2013

Sorry, I meant Ewan.