/tags/2012-winter/index.xml 2012 Winter - McGill Statistics Seminars
  • Li: High-dimensional feature selection using hierarchical Bayesian logistic regression with heavy-tailed priors | Rao: Best predictive estimation for linear mixed models with applications to small area estimation

    Date: 2012-04-13

    Time: 14:00-16:30

    Location: MAASS 217

    Abstract:

    Li: The problem of selecting the most useful features from a great many (eg, thousands) of candidates arises in many areas of modern sciences. An interesting problem from genomic research is that, from thousands of genes that are active (expressed) in certain tissue cells, we want to find the genes that can be used to separate tissues of different classes (eg. cancer and normal). In this paper, we report a Bayesian logistic regression method based on heavytailed priors with moderately small degree freedom (such as 1) and small scale (such as 0.01), and using Gibbs sampling to do the computation. We show that it can distinctively separate a couple of useful features from a large number of useless ones, and discriminate many redundant correlated features. We also show that this method is very stable to the choice of scale. We apply our method to a microarray data set related to prostate cancer, and identify only 3 genes out of 6033 candidates that can separate cancer and normal tissues very well in leave-one-out cross-validation.

  • Hypothesis testing in finite mixture models: from the likelihood ratio test to EM-test

    Date: 2012-04-05

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In the presence of heterogeneity, a mixture model is most natural to characterize the random behavior of the samples taken from such populations. Such strategy has been widely employed in applications ranging from genetics, information technology, marketing, to finance. Studying the mixing structure behind a random sample from the population allows us to infer the degree of heterogeneity with important implications in applications such as the presence of disease subgroups in genetics. The statistical problem is to test the hypotheses on the order of the finite mixture models. There has been continued interest in the limiting behavior of the likelihood ratio tests. The non-regularity of the finite mixture models has provided statisticians ample examples of unusual limiting distributions. Yet many of such results are not convenient for conducting hypothesis tests. Motivated at overcoming such difficulties, we have developed a number of strategies to obtain tests with high efficiency yet easy to use limiting distributions. The latest development is a class of EM-tests which are advantageous in many respects. Their limiting distributions are easier to derive mathematically, simple for implementation in data analysis and valid for more general class of mixture models without restrictions on the space of the mixing distribution. The simulation indicates the limiting distributions have good precision at approximating the finite sample distributions in the examples investigated.

  • A matching-based approach to assessing the surrogate value of a biomarker

    Date: 2012-03-30

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Statisticians have developed a number of frameworks which can be used to assess the surrogate value of a biomarker, i.e. establish whether treatment effects on a biological quantity measured shortly after administration of treatment predict treatment effects on the clinical endpoint of interest. The most commonly applied of these frameworks is due to Prentice (1989), who proposed a set of criteria which a surrogate marker should satisfy. However, verifying these criteria using observed data can be challenging due to the presence of unmeasured simultaneous predictors (i.e. confounders) which influence both the potential surrogate and the outcome. In this work, we adapt a technique proposed by Rosenbaum (2002) for observational studies, in which observations are matched and the odds of treatment within each matched pair is bounded. This yields a straightforward and interpretable sensitivity analysis which can be performed particularly efficiently for certain types of test statistics. In this talk, I will introduce the surrogate endpoint problem, discuss the details of my proposed technique for assessing surrogate value, and illustrate with some simulated examples inspired by the problem of identifying immune surrogates in HIV vaccine trials.

  • Model selection principles in misspecified models

    Date: 2012-03-23

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Model selection is of fundamental importance to high-dimensional modeling featured in many contemporary applications. Classical principles of model selection include the Bayesian principle and the Kullback-Leibler divergence principle, which lead to the Bayesian information criterion and Akaike information criterion, respectively, when models are correctly specified. Yet model misspecification is unavoidable in practice. We derive novel asymptotic expansions of the two well-known principles in misspecified generalized linear models, which give the generalized BIC (GBIC) and generalized AIC. A specific form of prior probabilities motivated by the Kullback-Leibler divergence principle leads to the generalized BIC with prior probability ($\mbox{GBIC}_p$), which can be naturally decomposed as the sum of the negative maximum quasi-log-likelihood, and a penalty on model dimensionality, and a penalty on model misspecification directly. Numerical studies demonstrate the advantage of the new methods for model selection in both correctly specified and misspecified models.

  • Variable selection in longitudinal data with a change-point

    Date: 2012-03-16

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Follow-up studies are frequently carried out to investigate the evolution of measurements through time, taken on a set of subjects. These measurements (responses) are bound to be influenced by subject specific covariates and if a regression model is used the data analyst is faced with the problem of selecting those covariates that “best explain” the data. For example, in a clinical trial, subjects may be monitored for a response following the administration of a treatment with a view of selecting the covariates that are best predictive of a treatment response. This variable selection setting is standard. However, more realistically, there will often be an unknown delay from the administration of a treatment before it has a measurable effect. This delay will not be directly observable since it is a property of the distribution of responses rather than of any particular trajectory of responses. Briefly, each subject will have an unobservable change-point. With a change-point component added, the variable selection problem necessitates the use of penalized likelihood methods. This is because the number of putative covariates for the responses, as well as the change-point distribution, could be large relative to the follow-up time and/or the number of subjects; variable selection in a change-point setting does not appear to have been studied in the literature. In this talk I will briefly introduce the multi-path change-point problem. I will show how variable selection for the covariates before the change, after the change, as well as for the change-point distribution, reduces to variable selection for a finite mixture of multivariate distributions. I will discuss the performance of my model selection methods using an example on cognitive decline in subjects with Alzheimer’s disease and through simulations.

  • Using tests of homoscedasticity to test missing completely at random | Hugh Chipman: Sequential optimization of a computer model and other Active Learning problems

    Date: 2012-03-09

    Time: 14:00-16:30

    Location: UQAM, 201 ave. du Président-Kennedy, salle 5115

    Abstract:

    Li: The problem of selecting the most useful features from a great many (eg, thousands) of candidates arises in many areas of modern sciences. An interesting problem from genomic research is that, from thousands of genes that are active (expressed) in certain tissue cells, we want to find the genes that can be used to separate tissues of different classes (eg. cancer and normal). In this paper, we report a Bayesian logistic regression method based on heavytailed priors with moderately small degree freedom (such as 1) and small scale (such as 0.01), and using Gibbs sampling to do the computation. We show that it can distinctively separate a couple of useful features from a large number of useless ones, and discriminate many redundant correlated features. We also show that this method is very stable to the choice of scale. We apply our method to a microarray data set related to prostate cancer, and identify only 3 genes out of 6033 candidates that can separate cancer and normal tissues very well in leave-one-out cross-validation.

  • Estimating a variance-covariance surface for functional and longitudinal data

    Date: 2012-03-02

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In functional data analysis, as in its multivariate counterpart, estimates of the bivariate covariance kernel σ(s,t ) and its inverse are useful for many things, and we need the inverse of a covariance matrix or kernel especially often. However, the dimensionality of functional observations often exceeds the sample size available to estimate σ(s,t, and then the analogue S of the multivariate sample estimate is singular and non-invertible. Even when this is not the case, the high dimensionality S often implies unacceptable sample variability and loss of degrees of freedom for model fitting. The common practice of employing low-dimensional principal component approximations to σ(s,t) to achieve invertibility also raises serious issues.

  • McGillivray: A penalized quasi-likelihood approach for estimating the number of states in a hidden Markov model | Best: Risk-set sampling and left truncation in survival analysis

    Date: 2012-02-17

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    McGillivray: In statistical applications of hidden Markov models (HMMs), one may have no knowledge of the number of hidden states (or order) of the model needed to be able to accurately represent the underlying process of the data. The problem of estimating the number of hidden states of the HMM is thus brought to the forefront. In this talk, we present a penalized quasi-likelihood approach for order estimation in HMMs which makes use of the fact that the marginal distribution of the observations from a HMM is a finite mixture model. The method starts with a HMM with a large number of states and obtains a model of lower order by clustering and combining similar states of the model through two penalty functions. We assess the performance of the new method via extensive simulation studies for Normal and Poisson HMMs.

  • Stute: Principal component analysis of the Poisson Process | Blath: Longterm properties of the symbiotic branching model

    Date: 2012-02-10

    Time: 14:00-16:30

    Location: Concordia

    Abstract:

    Stute: The Poisson Process constitutes a well-known model for describing random events over time. It has many applications in marketing research, insurance mathematics and finance. Though it has been studied for decades not much is known how to check (in a non-asymptotic way) the validity of the Poisson Process. In this talk we present the principal component decomposition of the Poisson Process which enables us to derive finite sample properties of associated goodness-of-fit tests. In the first step we show that the Fourier-transforms of the components contain Bessel and Struve functions. Inversion leads to densities which are modified arc sin distributions.

  • Du: Simultaneous fixed and random effects selection in finite mixtures of linear mixed-effects models | Harel: Measuring fatigue in systemic sclerosis: a comparison of the SF-36 vitality subscale and FACIT fatigue scale using item response theory

    Date: 2012-02-03

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Du: Linear mixed-effects (LME) models are frequently used for modeling longitudinal data. One complicating factor in the analysis of such data is that samples are sometimes obtained from a population with significant underlying heterogeneity, which would be hard to capture by a single LME model. Such problems may be addressed by a finite mixture of linear mixed-effects (FMLME) models, which segments the population into subpopulations and models each subpopulation by a distinct LME model. Often in the initial stage of a study, a large number of predictors are introduced. However, their associations to the response variable vary from one component to another of the FMLME model. To enhance predictability and to obtain a parsimonious model, it is of great practical interest to identify the important effects, both fixed and random, in the model. Traditional variable selection techniques such as stepwise deletion and subset selection are computationally expensive as the number of covariates and components in the mixture model increases. In this talk, we introduce a penalized likelihood approach and propose a nested EM algorithm for efficient numerical computations. Our estimators are shown to possess desirable properties such as consistency, sparsity and asymptotic normality. We illustrate the performance of our method through simulations and a systemic sclerosis data example.