Uncategorized | Mauricio Nascimento

The main objective of this paper is detect selective sweeps . First selective sweeps is” the reduction or elimination of variation among the nucleotides in the neighboring DNA of a mutation as the results of recent and strong positive natural selection”[2]. The identification of this selective sweeps are important because it increases the evidences of links between selection and disease genes[1].

The most used method used to find this sweeps is the method which uses the composed likelihood*. It uses the following formula.

$CL=\frac{CLWS}{CLAS}$

CLWS is the composed likelihood under a model that do not allows selective sweeps
CLAS is the composed likelihood under a model that allows selective sweeps

This test may be slow when applied to real data and sensitive to assumptions regarding mutations genes and rates recombination.

*Obs. The composite likelihood is calculate as the multiplication of the marginals likelihoods for each site.

So the authors proposes two new methods to detect selective sweeps. Both of them also based on compose likelihood but they differ from the previous ones on the null hypothesis. The proposed tests do not consider a specific population genetic model. Besides that they use the variation on the data itself.

The first test “Aberrant site frequency spectrum”

The objective of this first test is to detect which are the regions where the allele frequency differs from the frequency on all data. To do so first it is defined some things.

Let $X_j$ be the frequency of derived alleles* in a sample of n chromosomes, so $1 \leq X_j \leq n-1$
The probability of a derived allele of frequency j in the sample is $p_j$ , then assuming homogeneity along the genome $P(X_i=j)=p_j$
$\mathbf(p)=(p_1,p_2,…,p_{n-1})$

For k SNPs** the composite likelihood is write as:

$CL_1(\mathbf{p})=\prod_{i=1}^k p_{x_i}=\prod_{j=1}^{n-1}p_j^{k_j}� �$

$k_j�$ is the numbers of SNPs with derived allele frequency j in the sample. We can also calculate the composite likelihood for a subset of the genomic data, from $v \leftrightarrow b�$ .

$CL_1(\mathbf{p}; � v \leftrightarrow b )=\prod_{i=v}^b p_{x_i}$

*Obs A derived allele is an allele that arises in the evolution due to a mutation.[3]

** SNP Single Nucleotide Polymorphism is variation occurring commonly within a population, which a single nucleotide in the genome differs between members of a biological species.[4]

With this information we calculate the standard likelihood ratio for the window. The ratio is calculated using the log.

$T_1=2\{ � logCL_1(\mathbf{\widehat{p}}_{� v \leftrightarrow b};� v \leftrightarrow b)-logCL_1(\mathbf{\widehat{p}};� v \leftrightarrow b) \}$

$\mathbf{\widehat{p}}$ is the composite likelihood based on all data
$\mathbf{\widehat{p}}_{� v \leftrightarrow b}$ is the composed likelihood based on the window from v to b.

This test verifies if there is some difference between the allele frequency in the window and the one from global set.

The second test “Parametric Approach”

The test will detect selective sweeps based on the way the spatial distribution of frequency spectra* is affected by a selective sweep.

The allele frequency spectrum in the population before the selective sweep will be $\mathbf{p}=(p_1,p_2,…,p_{n-1})$
Each ancestral lineage has an independent and identically distributed probability of not changing because of the selective sweep $(P_e)$ . This probability is defined has $P_e=1-e^{-\alpha d}$ . d is the distance from the sweep and $\alpha$ is a parameter that depends on the rate of recombination, the size of the population and selection coefficient of the selected mutation.

Under some assumptions the probability of $P_e(k)$ where $0<k<n$ is a binomial distribution with parameters $P_e$ and n.

The probability of observing j mutant lineages in an ancestral sample of size H is.

$p_{j,H}=\sum_{i=j}^{n-1}p_i \frac{ � \{i\}choose\{j\} \{n-i\}� choose\{H-j\}}{ \{n\}� choose\{H\}}$

$H=min\{n,k+1\}$

With this the probability of observing a mutant allele of frequency B out of n after the selective sweep is.

$p_B^*=P_e(n)p_B+\sum_{k=0}^{n-1}P_e(k) \left( p_{B+1-n+k,k+1}\frac{B+1-n+k}{k+1}+� p_{B,k+1}\frac{k+1-B}{k+1} \right)$

With this expression it`s calculated the composite likelihood to a selective sweep of intensity $\alpha[latex]. To calculate the significance of the test it necessary to use simulation using coalescence. <strong>Others Test</strong> It was also considered for analysis two others tests. <ol> <li>Mann-Whitney U-test(MWU) use to test the excess or deficiency of low-frequency derived alleles.</li> <li>Tajima`s test</li> </ol> Both of them had their critical values determinate using simulations. <strong>Corrections</strong> Because the way that the data was collected there exist a bias called ``Ascertainment Bias''. So the likelihood was modified to include this information. Consider [latex]\theta$ as the vector of parameters $\mathbf{p},\alpha$ . The likelihood is modified the following way.

$L(\theta) \propto P(X_i=x_i \vert \theta;Asc_i)=\frac{P(Asc_i \vert X_i=x_i,\theta)P(X_i=x_i\vert \theta)}{P(Asc_i\vert \theta)}$

Lets consider that the ascertainment happened on a subsample of size d.

$P(Asc_i \vert X_i=x_i,\theta)=1-\frac{\{x_i\}choose\{d\}\times\{n-x_i\}choose\{d\} } {\{n\}choose\{d\} }$ $� P(Asc_i\vert \theta)=\sum_{j=1}^{n-1}P(X_i=j\vert \theta)P(Asc_i\vert X_i=j)$

Evaluations

Some of the results fond using simulated data.

When evaluating the power of the tests applied to unfolded frequency spectrum, the MWU had little power. Test 1 and test 2 had a power bigger than Tajima`s and the power of test 2 was higher than the power of test 1.
Evaluating the power of tests applied to folded frequency spectrum the MWU had a better result than before. It is now comparable to Tajima`s. The power of the other two tests is unaffected.
Evaluating the power and robustness of the tests under a demographic model. Tajima`s rejected in 100% of the cases, even when there is no selective sweep. The other tests were more robust and conservatives under models of population grown.

Analysis of Seattle SNP data

The gene that showed the strongest evidence of selective sweep was the C3. This intron contains many different when compared to the chimpanzee sequence(the one considered the ancestral).

Analysis of HapMap data

The analysis was made using ascertainment and without using it. The main effect of the ascertainment was that it reduced the composite likelihood ratio for several significant peaks. This suggested that if it is not used it may lead to an excess of false positives.

The major peak was on position $1.36 \times 10^8$ , it is centered on the lactose locus. Besides this genes many others, several linked to disease factors, were found.

Conclusion

The test that showed the best results was the “Parametric Approach”.

It presented robustness to demographic factors(change in population size and realistic models of human demography) and assumptions regarding the recombination rate.
It has power to detect recent selective sweeps.
It is computationally fast.
Can incorporate ascertainment bias

Some problems with this test are:

It may not be robust to some demographic factors not studies on the paper.
Some of the results may have been caused by other type of selection than the selective sweep.
May not have power to detect other types of selections.

Sumarized by Mauricio

[1]Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, et al. (2005) Genomic scans for selective sweeps using SNP data. Genome Research 15: 1566–1575

[2] http://en.wikipedia.org/wiki/Selective_sweep

[3] https://www.biostars.org/p/61267/

[4] http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism

M	T	W	T	F	S	S

1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Mauricio Nascimento

Category Archives: Uncategorized

Summary of “Genomic Scans for Selective Sweeps Using SNP Data”

Hello world!