Doubly Censored Data Regression – Astrostatistics and Astroinformatics Portal

Posted by Shivaei, Irene at November 14. 2013

Hello everyone,

I am working with a sample of galaxies with multi-wavelength data and I am trying to find a relation between quantities that we measure based on different luminosities (say H-alpha, UV, IR). In this dataset I have undetected objects for both IR and H-alpha.

Now, I want to find the relation between something measured from Ha (like star formation rate) and the same thing measured from IR. I want to fit a linear fit to the logarithmic values. Since I have doubly censored data (both y and x – IR and Ha – are subject to censoring) the best way to calculate the fit parameters is to use Akritas–Thiel–Sen (ATS) line which is provided in cenkenfunction (NADA library).

I have a few questions in this regard:

1) When I run the cenken function for my data sometimes I get a NULL value for the intercept; why is that and how I can avoid it? Is it something that I should get worried about?

2) cenken function in NADA library only provides the slope and intercept and not the uncertainties in those parameters. Is there any package that provide line fit function for doubly censored data with the uncertainties?

3) Is there a way to force the intercept to be zero?

Thanks a lot for your help in advance. It’s very appreciated.

Posted by Cameron, Ewan at November 20. 2013

Hi Irene,

I’ve never used this estimator or R package before, but since no one better qualified has answered yet I can give my two cents. For uncertainties it seems you will need to do your own bootstrap: the way to do this is slightly more complicated/subtle for the ATS estimator than I would have expected but shouldn’t take long to code up (see, e.g., page 208 of Wilcox http://books.google.com.au/books?id=YSFb4QX2UIoC&pg=PA207&redir_esc=y#v=onepage&q&f=false ).

Apparently Wilcox’s method also gives an uncertainty on the intercept (which is promising because my first thought was that perhaps the ATS estimator is only consistent for the slope but not necessarily the intercept?).

These NULL values mostly arise when the intercept tends towards infinity, like in logistic regression when the data are completely separable. So I would imagine this NULL is only returned when the slope is something near 0. If not, then you’ll have to dig deeper into the Icens package (called by NADA) to find out how the slope is being computed …

Good luck!

Ewan

Posted by Feigelson, Eric at December 04. 2013

Irene,

There is a small research literature on doubly censored linear regression (below). But you should check: `double censorship’ can either mean that both variables are right(left)-censored, or that the dependent variable is both right and left censored. I do not know if any of these have been implemented in the R/CRAN system.

LINEAR REGRESSION WITH DOUBLY CENSORED DATA BY CUN-HUI ZHANG1 AND XIN LI Annals of Statistics 24, 2720-2743 1996

Regression M-estimators with doubly censored data Minggao Gu and Jian-Jian Ren Ann. Statist. Volume 25, Number 6 (1997), 2638-2664. [M-estimators are robust against outliers]

Regression M-estimators with non-i.i.d. doubly censored data Jian-Jian Ren Ann. Statist. Volume 31, Number 4 (2003), 1186-1219.

Quantile regression with doubly censored data Guixian Lin Xuming He, Stephen Portnoy Computational Statistics & Data Analysis Volume 56, Issue 4, 1 April 2012, Pages 797–812 [This gives regression lines for the median and other quantiles, good for large datasets with non-Gaussian scatter]

Quantile Regression for Doubly Censored Data Shuang Ji, Limin Peng, Yu Cheng, HuiChuan Lai, Biometrics Volume 68, Issue 1, pages 101–112, March 2012

There is a trick that you might consider. I am not sure that it gives correct confidence intervals, as it involves multiple hypothesis tests on a single dataset. Let me know if you want advice from a statistician on this point. Here goes:

Your data is of the form (x,y) where both x and y are left-censored
Transform your data using a trial linear slope, y’ = y – beta_0 x
Now apply a generalized Kendall’s tau test (e.g. cenken in CRAN’s NADA package) to determine with (say) 95% confidence that the data show no correlation.
Do this for a range of beta_0 values to obtain a range of beta with that give no correlation. This gives an estimate of the confidence interval of beta, although it does not give a best-fit value for beta.

Posted by Shivaei, Irene at December 04. 2013

Thanks Ewan and Eric for your helpful comments. I should work on it.