A Summary of Dependence Estimation and Visualization in Multivariate Extremes with Applications to Financial Data by Hsing et al. (2005)

Hsing et al. set out to develop a new function for measuring extremal dependence nonparametrically. They begin by reviewing some of the goals of extreme value analysis. Namely, one of those is to understand the dependence of a random process at extreme levels. If we let \mathbf{X_i} = (X_{i,1}, …, X_{i,m}), i = 1,2, \ldots be an iid sequence of random vectors, then of interest is doing inference on the dependence structure of componentwise maxima M_{n,j} = \vee_{i = 1}^n, 1 \leq j \leq m.

The authors motivate the topic with common examples where understanding of the dependence structure can lead to different behaviors. For example, when designing an investment portfolio is is necessary to understand the dependence of individual losses among the individual assets that make up the portfolio. One can also imagine that it would be critical to understand how extreme rainfall is temporally dependent to design infrastructure that can mitigate possible floods.

They note that some existing methods for measuring extremal dependence assume a parametric form for F, and can be misleading if the model is incorrect. Before describing their method Hsing et al. review how the distribution of a random vector can be equivalently formulated using copulas, and hence also the dependence structure of a random vector is captured by its copula. A copula is a (multivariate) cumulative distribution function with uniform margins defined by

  C_F(u_1, …, u_m) = P(F(X_1)\leq u_1, … F(X_m) \leq u_m)

for (u_1, …, u_m) \in [0,1]^m

An extreme copula, which is the focus of their method, is one that satisfies
  C^t(u_1, …, u_m) = C(u_1^t, …, u_m^t)

Their approach uses the Pickands representation of a copula, which I will not include here but essentially uses a change of measure to one (the spectral measure \Phi(\theta)) on [0, \pi/2].
Consider the data with m dimensions. And consider \theta_2,…,\theta_m to be angles in [0,\pi/2] after the change in measure. They defined the tail dependence to be calculated as

  \rho(\theta_2,…,\theta_m)=\frac{(1+cot\theta_2+…+cot\theta_m)-(\psi(\theta_2,…,\theta_m))}{(1+cot\theta_2+…+cot\theta_m)-(1\vee cot\theta_2\vee…\vee cot \theta_m)}

Where \psi is a function defined using the new measure in [0,\pi/2]. Values of \rho near 0 represent weak dependence and values near 1 represents strong dependence. Since the measure is not known, consequently \psi is not known so it is necessary to estimate it using a nonparametric estimator.

The application of this method was done returns of Annually Compounded Zero Coupon Swap Rates with different maturities and currencies. The tail dependence was estimated for combinations maturities. Some of the results showed strong dependence between 7 days and 30 days, 10 years with 15/20/30 years.

There is moderate dependence between 7 days and 6 months. There is weak dependence between 7 days and 30 years and between 30 days and 1/30 years.

The dependence between 30 days and 60 days varies with \theta. For values of \theta close to \pi/4 the dependence becomes moderate and for values far from \pi/4 the dependence becomes strong.

Besides the bivariate analysis 2 trivariate examples were analyzed. The first one is the dependence of swap rates with 7 days maturity between currencies USD, EUR and GBP. This data showed low tail dependence with values increasing in the direction of the edge of the parameter space, but values still smaller than .5.

The second trivariate shows the dependence between swap rates with maturities 5,6 and 7 years for EUR. Almost all values are close to 1, the dependence decreases a little for values of \theta_2,\theta_3 close to \pi/4, but it is still bigger than 0.8.

Based on the simulation study that they did in the paper the method seems to work fine. During the simulation this estimator was able to retrieve the correct dependence. One missing thing on the paper was moments when this may fail, a problem presented during the analysis of the real data is that the lack of data in some parts of the parameter space does not allows to do estimation. However, no other possible problem or maybe extension was presented.

Reference:

Hsing T, Klüppelberg C, Kuhn G. Dependence Estimation and Visualization in Multivariate Extremes with Applications to Financial Data. Extremes. 2004;7(2):99-121.

Written by:
Greg, Mauricio

Summary of “Genomic Scans for Selective Sweeps Using SNP Data”

The main objective of this paper is detect selective sweeps . First selective sweeps is” the reduction or elimination of  variation among the nucleotides in the neighboring DNA of a mutation as the results of recent and strong positive natural selection”[2]. The identification of this selective sweeps are important because it increases the evidences of links between selection and disease genes[1].

The most used method used to find this sweeps is the method which uses the composed likelihood*. It uses the following formula.

 CL=\frac{CLWS}{CLAS}
  • CLWS is the composed likelihood under a model that do not allows selective sweeps
  • CLAS is the composed likelihood under a model that allows selective sweeps

This test may be slow when applied to real data and sensitive to assumptions regarding mutations genes and rates recombination.

*Obs. The composite likelihood is calculate as the multiplication of the marginals likelihoods for each site.

So the authors proposes two new methods to detect selective sweeps. Both of them also based on compose likelihood but they differ from the previous ones on the null hypothesis. The proposed tests do not consider a specific population genetic model. Besides that they use the variation on the data itself.

The first test “Aberrant site frequency spectrum”

The objective of this first test is to detect which are the regions where the allele frequency differs from the frequency on all data. To do so first it is defined some things.

  • Let  X_j be the frequency of derived alleles* in a sample of n chromosomes, so  1 \leq X_j \leq n-1
  • The probability of a derived allele of frequency j in the sample is  p_j , then assuming homogeneity along the genome  P(X_i=j)=p_j
  •  \mathbf(p)=(p_1,p_2,…,p_{n-1})

For k SNPs** the composite likelihood is write as:

is the numbers of SNPs with derived allele frequency j in the sample. We can also calculate the composite likelihood for a subset of the genomic data, from  .

*Obs A derived allele is an allele that arises in the evolution due to a mutation.[3]

** SNP Single Nucleotide Polymorphism is variation occurring commonly within a population, which a single nucleotide in the genome differs between members of a biological species.[4]

With this information we calculate the standard likelihood ratio for the window. The ratio is calculated using the log.

  •  \mathbf{\widehat{p}} is the composite likelihood based on all data
  • is the composed likelihood based on the window from to b.

This test verifies if there is some difference between the allele frequency in the window and the one from global set.

 

The second test “Parametric Approach”

The test will detect selective sweeps based on the way the spatial distribution of frequency spectra* is affected by a selective sweep.

  • The allele frequency spectrum in the population before the selective sweep will be  \mathbf{p}=(p_1,p_2,…,p_{n-1})
  • Each ancestral lineage has an independent and identically distributed probability of not changing because of the selective sweep  (P_e). This probability is defined has P_e=1-e^{-\alpha d}d is the distance from the sweep and  \alpha is a parameter that depends on the rate of recombination, the size of the population and selection coefficient of the selected mutation.

Under some assumptions the probability of P_e(k) where 0<k<n is a binomial distribution with parameters P_e and n.

The probability of observing j mutant lineages in an ancestral sample of size H is.

  • H=min\{n,k+1\}

With this the probability of observing a mutant allele of frequency B out of n after the selective sweep is.

With this expression it`s calculated the composite likelihood to a selective sweep of intensity \alpha[latex]. To calculate the significance of the test it necessary to use simulation using coalescence.   <strong>Others Test</strong>   It was also considered for analysis two others tests.  <ol> <li>Mann-Whitney U-test(MWU) use to test the excess or deficiency of low-frequency derived alleles.</li> <li>Tajima`s test</li> </ol>  Both of them had their critical values determinate using simulations.   <strong>Corrections</strong>   Because the way that the data was collected there exist a bias called ``Ascertainment Bias''. So the likelihood was modified to include this information. Consider [latex]\theta as the vector of parameters\mathbf{p},\alpha. The likelihood is modified the following way.

L(\theta) \propto P(X_i=x_i \vert \theta;Asc_i)=\frac{P(Asc_i \vert X_i=x_i,\theta)P(X_i=x_i\vert \theta)}{P(Asc_i\vert \theta)}

Lets consider that the ascertainment happened on a subsample of size d.

P(Asc_i \vert X_i=x_i,\theta)=1-\frac{\{x_i\}choose\{d\}\times\{n-x_i\}choose\{d\} } {\{n\}choose\{d\} }

Evaluations

Some of the results fond using simulated data.

  • When evaluating the power of the tests applied to unfolded frequency spectrum, the MWU had little power. Test 1 and test 2 had a power bigger than Tajima`s and the power of test 2 was higher than the power of test 1.
  • Evaluating the power of tests applied to folded frequency spectrum the MWU had a better result than before. It is now comparable to Tajima`s.  The power of the other two tests is unaffected.
  • Evaluating the power and robustness of the tests under a demographic model. Tajima`s rejected in 100% of the cases, even when there is no selective sweep. The other tests were more robust and conservatives under models of population grown.

Analysis of Seattle SNP data

The gene that showed the strongest evidence of selective sweep was the C3. This intron contains many different  when compared to the chimpanzee sequence(the one considered the ancestral).

Analysis of HapMap data

The analysis was made using ascertainment and without using it. The main effect of the ascertainment was that it reduced the composite likelihood ratio for several significant peaks. This suggested that if it is not used it may lead to an excess of false positives.

The major peak was on position 1.36 \times 10^8, it is centered on the lactose locus. Besides this genes many others, several linked to disease factors, were found.

Conclusion

The test that showed the best results was the “Parametric Approach”.

  • It presented robustness to demographic factors(change in population size and realistic models of human demography) and assumptions regarding the recombination rate.
  • It has power to detect recent selective sweeps.
  • It is computationally fast.
  • Can incorporate ascertainment bias

Some problems with this test are:

  • It may not be robust to some demographic factors not studies on the paper.
  • Some of the results may have been caused by other type of selection than the selective sweep.
  • May not have power to detect other types of selections.

 

Sumarized by Mauricio

 

[1]Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, et al. (2005) Genomic scans for selective sweeps using SNP data. Genome Research 15: 1566–1575

[2] http://en.wikipedia.org/wiki/Selective_sweep

[3] https://www.biostars.org/p/61267/

[4] http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism

Skip to toolbar