Friday, March 19, 2010

Statistics of Genetic Hybrid Zones (posted by Robert W. Jernigan)


Part of my research is with biologists at the Smithsonian's Laboratory of Molecular Systematics and a team of biology students and post docs studying the evolutionary biology of hybrids zones of Central American birds. Hybrid zones are regions where two genetically divergent forms meet and hybridize in a narrow zone of contact. Hybridizing forms that appear morphologically quite different may actually be exchanging genes that move freely across the zone. Our work is in documenting and quantifying the extent of this hybridization. I provide statistical expertise and methods for these studies.

Genetic data are collected on the different species. Those on one side of the zone possess a particular genetic allele and those on the other side typically lack that allele. I employ a non-parametric regression procedure using smoothing splines to quantify and graph this changing frequency across the zone. This graph is referred to as a cline.

Of interest to the biologists is the width and location of this zone and how they differ statistically among different genetic markers. The findings lead to better understanding of evolutionary actions and gene flow.

Deciding how much to smooth a fitted curve to data has been the subject of much research. A common approach is to use generalized cross-validation. This procedure, although more general, is conceptually similar to omitting each data point in turn and choosing the smoothing parameter that best predicts the omitted points. Smooth functions chosen in this way are not necessarily monotonic.

Because of the strong monotonic pattern of many of our markers and the belief that a smooth monotonic function best describes the cline, we have chosen the smoothing parameter based on the minimal amount of smoothing that guarantees monotonicity of the fit. For each smoothing parameter the resulting cline has a unique measure of roughness. This roughness is indexed by the smoothing parameter or by equivalent degrees of freedom for the fitted function. These equivalent degrees of freedom indicate the number of parameters needed to specify a function of the desired smoothness. For example, as the smoothing parameter gets large the degrees of freedom approach 2 indicating two parameters (slope and intercept in a logistic sense) are needed to fit the data. As the smoothing parameter approaches zero, the equivalent degrees of freedom will grow until a non-monotonic interpolating cubic spline is reached. Algorithmically, as the equivalent degrees of freedom are reduced from a large value down to the value of 2, the resulting cubic spline fits go from being a non-monotonic interpolating function to a monotonic logistic fit (specified by a slope and intercept.) The largest degrees of freedom that results in a monotonic fitting function is the value we investigate. In the graph above, this value is somewhere between 3 and 4. This results in a smooth monotonic compromise between a two parameter, i.e. logistic fit and a many parameter rougher, non-monotonic interpolating fit.

View our latest efforts in Molecular Ecology.

No comments:

Post a Comment