Censoring in Bimodality Tests

 Posted by Desjardins, Tyler at February 07. 2014

I have a distribution of galaxy masses that is suggestive of bimodality. I would like to test if this is statistically significant using, e.g., the Hartigan dip test in R, but would like to include the effects of censoring on the data. Specifically, I have a number of lower limits that are, with one exception, on the high-mass side of the dip. This would clearly influence the result of the test in favor of bimodality, but because the diptest package in R lacks the ability to include censoring, the result is only marginally significant (the result of the dip test is 0.07). Is there any way to include censoring with this test, or is there another test that can accept censoring?

Tyler D. Desjardins

Ph.D. Candidate,

The University of Western Ontario

 Posted by Zelterman, Daniel at February 07. 2014

Yes, this is a good thesis topic.

By itself, you probably can’t test for bi-modality with censored data.  But you might make some progress if you can make some assumptions about the censoring mechanisms.  Is censoring more likely with larger values, for example, or is censoring a random phenomenon, unrelated to anything else? Are sample from one population more likely to be censored than from the other?

Can you say anything about the underlying, uncensored population distributions?  Are the modes sufficiently far apart, for example, so that testing complete data is a reasonable task?

Another suggestion is to skip the test and estimate the parameters of the separate, mixed  populations.  Consider an EM algorithm that alternately estimates the different population parameters and then estimates the probability that each observation belongs to that population.  With a few additional assumptions, you might even be able to estimate the true values of those that were censored.

Dan Zelterman

 

 Posted by Cameron, Ewan at February 08. 2014

Hi Tyler,

I’m more of a Bayesian so my first instinct would be to treat the problem as one of Bayesian model selection; similar in spirit to Dan’s suggestion of estimating the parameters of the separate, mixed populations.

But one idea I had was that you could *estimate* the p-value for the complete sample by treating the non-censored observations as an empirical distribution approximation to the true underlying distribution.  E.g. for each censored galaxy draw an estimated (mock) mass from amongst the uncensored galaxies with masses greater than its censoring limit, and compute the test statistic on the full sample using the uncensored masses as usual and these mock masses for the censored galaxies.  Wouldn’t work for something like the K-S test where ties are a problem, but it might work for the Hartigan dip test.

cheers,

Ewan.

 Posted by Cameron, Ewan at February 10. 2014

In fact, I just had a read of some details of the Hartigan method and its statistic is based on a maximum difference over the observed datapoints, so I suspect that bootstrapping will unfortunately not solve this problem.