Monday, February 28, 2011

Bayesian Memory


In conventional statistical analysis, sample statistics are used to estimate unknown, constant population parameters. In Bayesian inference, on the other hand, the parameter is treated as a random value from another distribution. Such a distribution is called the prior and is used to model a priori knowledge about the parameter. By combining the information from the data and the prior, we then obtain a new distribution for the parameter. This new distribution is called the posterior, which describes how the parameter varies conditionally upon what we have observed from the data.

Compared to conventional statistical methods, Bayesian methods are very flexible in building complex hierarchical models, and often have more intuitive interpretations for data analysis. One of my research projects applies Bayesian methods to the study of human memory. Modern psychological theories suggest that human memory has both the conscious component (Recollection: one knows one remembers) and the unconscious component (Automatic Activation: one does not know one remembers). Together with my collaborators, I proposed a Bayesian model that describes and accurately evaluates these different components. The Bayesian model was used to analyze data from a psychology experiment of word completion tasks under three study durations: no study time, 1 second/word study time, and 10 second/word study time. The posterior distributions of the parameters that measure the main effects (in Probit transformation) of Recollection and Automatic Activation under different study durations are demonstrated below. From the plots, we see that that the conscious memory (Recollection) improved as the subject had more time to study the words, while the unconscious memory (Automatic Activation) would be in place once the subject studied the words but was not affected by the length of exposure.

Friday, July 2, 2010

What is spatial statistics?



Spatial data analysis involves data that represents points (e.g. soil core samples) or regions (e.g. counties), and can be regularly or irregularly spaced. The variable of interest may be discrete or continuous. The objective of a spatial data analysis may be to capture the trends apparent in the data set, e.g. showing that HIV mortality rates are less prevalent as one goes from south to north as shown in the figure above.

Most of the existing literature in spatial data analysis presents studies with data that are regular, points and continuous. In my work, I will focus on discrete data collected on irregular spaced intervals, representing regions. With regions, or lattice models, typically regions are linked or correlated with their neighbors. In irregular regions, neighbors are not immediately apparent. Sometimes it may be necessary to define what constitutes a neighbor, such as less than a fixed distance away. In my models, the locations are fixed or determined by the investigator and I may be interested in the number of cancer cases occuring in a region. In other models, the locations may be the primary interest, e.g. the location of a cancer case, to see if cancer cases are spatially clustered as shown in the figure above.


Spatial models on a lattice are analogous to time-series models in the sense that when building models for data on a lattice, there is not a realization occurring between locations or regions. In time-series you assume that a realization does not occur between the months of June and July if your index is in months. The main idea for my research is to develop and evaluate method that model the distribution of discrete outcomes when a location will depend on its neighbors. In spatial data, we do not have the unidirectional flow of time that occurs with time series. Therefore, spatial models are built on nearest neighbors.

Friday, March 19, 2010

Statistics of Genetic Hybrid Zones (posted by Robert W. Jernigan)


Part of my research is with biologists at the Smithsonian's Laboratory of Molecular Systematics and a team of biology students and post docs studying the evolutionary biology of hybrids zones of Central American birds. Hybrid zones are regions where two genetically divergent forms meet and hybridize in a narrow zone of contact. Hybridizing forms that appear morphologically quite different may actually be exchanging genes that move freely across the zone. Our work is in documenting and quantifying the extent of this hybridization. I provide statistical expertise and methods for these studies.

Genetic data are collected on the different species. Those on one side of the zone possess a particular genetic allele and those on the other side typically lack that allele. I employ a non-parametric regression procedure using smoothing splines to quantify and graph this changing frequency across the zone. This graph is referred to as a cline.

Of interest to the biologists is the width and location of this zone and how they differ statistically among different genetic markers. The findings lead to better understanding of evolutionary actions and gene flow.

Deciding how much to smooth a fitted curve to data has been the subject of much research. A common approach is to use generalized cross-validation. This procedure, although more general, is conceptually similar to omitting each data point in turn and choosing the smoothing parameter that best predicts the omitted points. Smooth functions chosen in this way are not necessarily monotonic.

Because of the strong monotonic pattern of many of our markers and the belief that a smooth monotonic function best describes the cline, we have chosen the smoothing parameter based on the minimal amount of smoothing that guarantees monotonicity of the fit. For each smoothing parameter the resulting cline has a unique measure of roughness. This roughness is indexed by the smoothing parameter or by equivalent degrees of freedom for the fitted function. These equivalent degrees of freedom indicate the number of parameters needed to specify a function of the desired smoothness. For example, as the smoothing parameter gets large the degrees of freedom approach 2 indicating two parameters (slope and intercept in a logistic sense) are needed to fit the data. As the smoothing parameter approaches zero, the equivalent degrees of freedom will grow until a non-monotonic interpolating cubic spline is reached. Algorithmically, as the equivalent degrees of freedom are reduced from a large value down to the value of 2, the resulting cubic spline fits go from being a non-monotonic interpolating function to a monotonic logistic fit (specified by a slope and intercept.) The largest degrees of freedom that results in a monotonic fitting function is the value we investigate. In the graph above, this value is somewhere between 3 and 4. This results in a smooth monotonic compromise between a two parameter, i.e. logistic fit and a many parameter rougher, non-monotonic interpolating fit.

View our latest efforts in Molecular Ecology.

Thursday, March 18, 2010

Welcome


This is a blog to publish information about research in Statistics at the American University in Washington, DC.