/post/index.xml Past Seminar Series - McGill Statistics Seminars
  • Regularized semiparametric functional linear regression

    Date: 2012-09-21

    Time: 14:30-15:30

    Location: McGill, Burnside Hall 1214

    Abstract:

    In many scientific experiments we need to face analysis with functional data, where the observations are sampled from random process, together with a potentially large number of non-functional covariates. The complex nature of functional data makes it difficult to directly apply existing methods to model selection and estimation. We propose and study a new class of penalized semiparametric functional linear regression to characterize the regression relation between a scalar response and multiple covariates, including both functional covariates and scalar covariates. The resulting method provides a unified and flexible framework to jointly model functional and non-functional predictors, identify important covariates, and improve efficiency and interpretability of the estimates. Featured with two types of regularization: the shrinkage on the effects of scalar covariates and the truncation on principal components of the functional predictor, the new approach is flexible and effective in dimension reduction. One key contribution of this paper is to study theoretical properties of the regularized semiparametric functional linear model. We establish oracle and consistency properties under mild conditions by allowing possibly diverging number of scalar covariates and simultaneously taking the infinite-dimensional functional predictor into account. We illustrate the new estimator with extensive simulation studies, and then apply it to an image data analysis.

  • Li: High-dimensional feature selection using hierarchical Bayesian logistic regression with heavy-tailed priors | Rao: Best predictive estimation for linear mixed models with applications to small area estimation

    Date: 2012-04-13

    Time: 14:00-16:30

    Location: MAASS 217

    Abstract:

    Li: The problem of selecting the most useful features from a great many (eg, thousands) of candidates arises in many areas of modern sciences. An interesting problem from genomic research is that, from thousands of genes that are active (expressed) in certain tissue cells, we want to find the genes that can be used to separate tissues of different classes (eg. cancer and normal). In this paper, we report a Bayesian logistic regression method based on heavytailed priors with moderately small degree freedom (such as 1) and small scale (such as 0.01), and using Gibbs sampling to do the computation. We show that it can distinctively separate a couple of useful features from a large number of useless ones, and discriminate many redundant correlated features. We also show that this method is very stable to the choice of scale. We apply our method to a microarray data set related to prostate cancer, and identify only 3 genes out of 6033 candidates that can separate cancer and normal tissues very well in leave-one-out cross-validation.

  • Hypothesis testing in finite mixture models: from the likelihood ratio test to EM-test

    Date: 2012-04-05

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In the presence of heterogeneity, a mixture model is most natural to characterize the random behavior of the samples taken from such populations. Such strategy has been widely employed in applications ranging from genetics, information technology, marketing, to finance. Studying the mixing structure behind a random sample from the population allows us to infer the degree of heterogeneity with important implications in applications such as the presence of disease subgroups in genetics. The statistical problem is to test the hypotheses on the order of the finite mixture models. There has been continued interest in the limiting behavior of the likelihood ratio tests. The non-regularity of the finite mixture models has provided statisticians ample examples of unusual limiting distributions. Yet many of such results are not convenient for conducting hypothesis tests. Motivated at overcoming such difficulties, we have developed a number of strategies to obtain tests with high efficiency yet easy to use limiting distributions. The latest development is a class of EM-tests which are advantageous in many respects. Their limiting distributions are easier to derive mathematically, simple for implementation in data analysis and valid for more general class of mixture models without restrictions on the space of the mixing distribution. The simulation indicates the limiting distributions have good precision at approximating the finite sample distributions in the examples investigated.

  • A matching-based approach to assessing the surrogate value of a biomarker

    Date: 2012-03-30

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Statisticians have developed a number of frameworks which can be used to assess the surrogate value of a biomarker, i.e. establish whether treatment effects on a biological quantity measured shortly after administration of treatment predict treatment effects on the clinical endpoint of interest. The most commonly applied of these frameworks is due to Prentice (1989), who proposed a set of criteria which a surrogate marker should satisfy. However, verifying these criteria using observed data can be challenging due to the presence of unmeasured simultaneous predictors (i.e. confounders) which influence both the potential surrogate and the outcome. In this work, we adapt a technique proposed by Rosenbaum (2002) for observational studies, in which observations are matched and the odds of treatment within each matched pair is bounded. This yields a straightforward and interpretable sensitivity analysis which can be performed particularly efficiently for certain types of test statistics. In this talk, I will introduce the surrogate endpoint problem, discuss the details of my proposed technique for assessing surrogate value, and illustrate with some simulated examples inspired by the problem of identifying immune surrogates in HIV vaccine trials.

  • Model selection principles in misspecified models

    Date: 2012-03-23

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Model selection is of fundamental importance to high-dimensional modeling featured in many contemporary applications. Classical principles of model selection include the Bayesian principle and the Kullback-Leibler divergence principle, which lead to the Bayesian information criterion and Akaike information criterion, respectively, when models are correctly specified. Yet model misspecification is unavoidable in practice. We derive novel asymptotic expansions of the two well-known principles in misspecified generalized linear models, which give the generalized BIC (GBIC) and generalized AIC. A specific form of prior probabilities motivated by the Kullback-Leibler divergence principle leads to the generalized BIC with prior probability ($\mbox{GBIC}_p$), which can be naturally decomposed as the sum of the negative maximum quasi-log-likelihood, and a penalty on model dimensionality, and a penalty on model misspecification directly. Numerical studies demonstrate the advantage of the new methods for model selection in both correctly specified and misspecified models.

  • Variable selection in longitudinal data with a change-point

    Date: 2012-03-16

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Follow-up studies are frequently carried out to investigate the evolution of measurements through time, taken on a set of subjects. These measurements (responses) are bound to be influenced by subject specific covariates and if a regression model is used the data analyst is faced with the problem of selecting those covariates that “best explain” the data. For example, in a clinical trial, subjects may be monitored for a response following the administration of a treatment with a view of selecting the covariates that are best predictive of a treatment response. This variable selection setting is standard. However, more realistically, there will often be an unknown delay from the administration of a treatment before it has a measurable effect. This delay will not be directly observable since it is a property of the distribution of responses rather than of any particular trajectory of responses. Briefly, each subject will have an unobservable change-point. With a change-point component added, the variable selection problem necessitates the use of penalized likelihood methods. This is because the number of putative covariates for the responses, as well as the change-point distribution, could be large relative to the follow-up time and/or the number of subjects; variable selection in a change-point setting does not appear to have been studied in the literature. In this talk I will briefly introduce the multi-path change-point problem. I will show how variable selection for the covariates before the change, after the change, as well as for the change-point distribution, reduces to variable selection for a finite mixture of multivariate distributions. I will discuss the performance of my model selection methods using an example on cognitive decline in subjects with Alzheimer’s disease and through simulations.

  • Using tests of homoscedasticity to test missing completely at random | Hugh Chipman: Sequential optimization of a computer model and other Active Learning problems

    Date: 2012-03-09

    Time: 14:00-16:30

    Location: UQAM, 201 ave. du Président-Kennedy, salle 5115

    Abstract:

    Li: The problem of selecting the most useful features from a great many (eg, thousands) of candidates arises in many areas of modern sciences. An interesting problem from genomic research is that, from thousands of genes that are active (expressed) in certain tissue cells, we want to find the genes that can be used to separate tissues of different classes (eg. cancer and normal). In this paper, we report a Bayesian logistic regression method based on heavytailed priors with moderately small degree freedom (such as 1) and small scale (such as 0.01), and using Gibbs sampling to do the computation. We show that it can distinctively separate a couple of useful features from a large number of useless ones, and discriminate many redundant correlated features. We also show that this method is very stable to the choice of scale. We apply our method to a microarray data set related to prostate cancer, and identify only 3 genes out of 6033 candidates that can separate cancer and normal tissues very well in leave-one-out cross-validation.

  • Estimating a variance-covariance surface for functional and longitudinal data

    Date: 2012-03-02

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In functional data analysis, as in its multivariate counterpart, estimates of the bivariate covariance kernel σ(s,t ) and its inverse are useful for many things, and we need the inverse of a covariance matrix or kernel especially often. However, the dimensionality of functional observations often exceeds the sample size available to estimate σ(s,t, and then the analogue S of the multivariate sample estimate is singular and non-invertible. Even when this is not the case, the high dimensionality S often implies unacceptable sample variability and loss of degrees of freedom for model fitting. The common practice of employing low-dimensional principal component approximations to σ(s,t) to achieve invertibility also raises serious issues.

  • McGillivray: A penalized quasi-likelihood approach for estimating the number of states in a hidden Markov model | Best: Risk-set sampling and left truncation in survival analysis

    Date: 2012-02-17

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    McGillivray: In statistical applications of hidden Markov models (HMMs), one may have no knowledge of the number of hidden states (or order) of the model needed to be able to accurately represent the underlying process of the data. The problem of estimating the number of hidden states of the HMM is thus brought to the forefront. In this talk, we present a penalized quasi-likelihood approach for order estimation in HMMs which makes use of the fact that the marginal distribution of the observations from a HMM is a finite mixture model. The method starts with a HMM with a large number of states and obtains a model of lower order by clustering and combining similar states of the model through two penalty functions. We assess the performance of the new method via extensive simulation studies for Normal and Poisson HMMs.

  • Stute: Principal component analysis of the Poisson Process | Blath: Longterm properties of the symbiotic branching model

    Date: 2012-02-10

    Time: 14:00-16:30

    Location: Concordia

    Abstract:

    Stute: The Poisson Process constitutes a well-known model for describing random events over time. It has many applications in marketing research, insurance mathematics and finance. Though it has been studied for decades not much is known how to check (in a non-asymptotic way) the validity of the Poisson Process. In this talk we present the principal component decomposition of the Poisson Process which enables us to derive finite sample properties of associated goodness-of-fit tests. In the first step we show that the Fourier-transforms of the components contain Bessel and Struve functions. Inversion leads to densities which are modified arc sin distributions.