hapFabia Identification of very short segments of

Hapfabia Identification Of Very Short Segments Of-PDF Download

  • Date:26 Nov 2019
  • Views:65
  • Downloads:0
  • Pages:36
  • Size:881.77 KB

Share Pdf : Hapfabia Identification Of Very Short Segments Of

Download and Preview : Hapfabia Identification Of Very Short Segments Of


Report CopyRight/DMCA Form For : Hapfabia Identification Of Very Short Segments Of


Transcription:

2 Contents,1 Introduction 3,2 Getting Started 7,2 1 Typical Analysis Pipeline 7. 2 2 Examples 10,3 hapFabia Method 23,3 1 FABIA for genotype data 23. 3 1 1 FABIA describes genotype data by IBD segments 24. 3 1 2 Adaptation of FABIA for IBD detection 26, 3 2 Extraction of IBD segments from FABIA models 27. 3 3 Further Advantages of HapFABIA 29,4 Tools to Analyze fabia Results 30. 1 Introduction 3,1 Introduction, This package hapFabia provides software for the method HapFABIA which identifies short iden.
tity by descent IBD segments that are tagged by rare variants in large sequencing data. Two haplotypes are identical by descent IBD if they share a segment that both inherited from. a common ancestor Current IBD methods reliably detect long IBD segments because many minor. alleles in the segment are concordant between the two haplotypes However many cohort studies. contain unrelated individuals which share only short IBD segments Short IBD segments contain. too few minor alleles to distinguish IBD from random allele sharing by recurrent mutations New. sequencing techniques improve the situation by providing rare variants which convey more infor. mation on IBD than common variants because random minor allele sharing of rare variants is less. likely than for common variants, Short IBD segments are of interest because i they resolve the genetic structure on a fine. scale and ii they can be assumed to be old In order to detect short IBD segments both the. information supplied by rare variants and information from more than two individuals should be. utilized The probability of a segment being IBD is typically computed via the probabilities of. randomly sharing single alleles within the segment The probability of randomly sharing a single. allele depends 1 on the allele frequency where lower frequency means lower probability of. random sharing and 2 on the number of individuals that share the allele where more individuals. means lower probability of random sharing Therefore a segment that contains rare variants and. is shared by more individuals has higher significance of being IBD These two characteristics are. the basis for detecting short IBD segments by HapFABIA. We propose biclustering Hochreiter et al 2010 to detect very short IBD segments that are. shared among multiple individuals Biclustering simultaneously clusters rows and columns of a. matrix In particular it clusters row elements that are similar to each other on a subset of column. elements A genotype matrix has individuals unphased or chromosomes phased as row ele. ments and SNVs as column elements Entries in the genotype matrix usually count how often the. minor allele of a particular SNV is present in a particular individual Alternatively minor allele. likelihoods or dosages may be used Individuals that share an IBD segment are similar to each. other at minor alleles of SNVs tagSNVs which tag the IBD segment see Fig 2 Therefore an. IBD segment that is shared among individuals corresponds to a bicluster because these individu. als are similar to one another at this segment Identifying a bicluster means identifying tagSNVs. column bicluster elements that tag an IBD segment and simultaneously identifying individuals. row bicluster elements that possess the IBD segment. In contrast to standard IBD detection methods biclustering considers multiple individuals In. contrast to standard clustering biclustering allows for SNVs or individuals that do not belong to. any cluster or to more than one bicluster Multiple cluster membership suits IBD detection because. diploid individuals can have two IBD segments at one locus and an SNV may tag more than one. IBD segment, FABIA is able to represent homozygous regions the same IBD segment in both chromosomes. by means of its factors At a locus overlapping IBD segments in one diploid individual a different. IBD segment in each of the two chromosomes are represented through additivity of biclusters in. the FABIA model Examples of short IBD segments found by hapFabia in chromosome 1 data. from the 1000 Genomes Project are given in Fig 3 and Fig 4. 4 1 Introduction, Figure 1 The IBD segment marked in yellow descended from a founder to different individuals. 1 Introduction 5,Original Data Bicluster,Id 01 Id 08. Id 02 Id 01,Id 03 Id 12,Id 04 Id 04,Id 05 Id 13,Id 06 Id 07.
individuals,individuals,Id 07 Id 14,Id 08 Id 15,Id 09 Id 06. Id 10 Id 10,Id 11 Id 02,Id 12 Id 09,Id 13 Id 11,Id 14 Id 03. Id 15 Id 05,Id 16 Id 16,20 40 60 80 20 40 60 80, Figure 2 Biclustering of a genotyping matrix Left original genotyping data matrix with indi. viduals as row elements and SNVs as column elements Minor alleles are indicated by violet bars. and major alleles by yellow bars for each individual SNV pair Right after sorting the rows the. detected bicluster can be seen in the top three individuals They contain the same IBD segment. which is marked in gold Biclustering simultaneously clusters rows and columns of a matrix so. that row elements here individuals are similar to each other on a subset of column elements here. the tagSNVs, chr 1 pos 8 698 269 length 57kbp SNPs 126 Samples 30. NA18501 YRI,NA18511 YRI,NA18522 YRI,NA18522 YRI,NA18873 YRI.
NA18916 YRI,NA18917 YRI,NA19095 YRI,NA19107 YRI,NA19116 YRI. NA19121 YRI,NA19175 YRI,NA19189 YRI,NA19190 YRI,NA19200 YRI. NA19223 YRI,NA19311 LWK,NA19360 LWK,NA19377 LWK,NA19381 LWK. NA19382 LWK,NA19390 LWK,NA19399 LWK,NA19431 LWK,NA19435 LWK. NA19435 LWK,NA19448 LWK,NA19457 LWK,NA19474 LWK,NA19711 ASW.
8 669 911 8 677 434 8 685 034 8 692 634 8 700 234 8 707 834 8 715 434 8 723 034. Figure 3 Example of an IBD segment in chromosome 1 found in the 1000 Genomes Project. data The y axis gives chromosomes and the x axis consecutive SNVs Yellow indicates major. alleles violet minor alleles of tagSNVs and blue minor alleles of other SNVs model L indicates. tagSNVs identified by hapFabia in violet A probable phasing error can be seen in line 3 and 4 at. individual NA18522 Another phasing error can be seen in the last but four and the last but five. line at individual NA19435,6 1 Introduction, chr 1 pos 51 721 665 length 52kbp SNPs 160 Samples 48. HG00159 GBR,HG00173 FIN,HG00310 FIN,HG00337 FIN,NA12341 CEU. NA12413 CEU,NA12843 CEU,NA18516 YRI,NA18516 YRI,NA18861 YRI. NA18871 YRI,NA18874 YRI,NA18907 YRI,NA18923 YRI,NA18934 YRI. NA19038 LWK,NA19121 YRI,NA19147 YRI,NA19147 YRI,NA19197 YRI.
NA19236 YRI,NA19256 YRI,NA19310 LWK,NA19310 LWK,NA19316 LWK. NA19328 LWK,NA19346 LWK,NA19350 LWK,NA19355 LWK,NA19371 LWK. NA19371 LWK,NA19384 LWK,NA19384 LWK,NA19396 LWK,NA19397 LWK. NA19430 LWK,NA19437 LWK,NA19444 LWK,NA19449 LWK,NA19468 LWK. NA19900 ASW,NA19914 ASW,NA19922 ASW,NA20127 ASW,NA20336 ASW.
NA20339 ASW,NA20344 ASW,NA20346 ASW, 51 695 899 51 702 729 51 709 629 51 716 529 51 723 429 51 730 329 51 737 229 51 744 129 51 747 432. Figure 4 Another example of an IBD segment from chromosome 1 of the 1000 Genomes Project. See Fig 3 for a description Again probable phasing errors at individuals NA18516 NA19310. and NA19384,2 Getting Started 7,2 Getting Started,2 1 Typical Analysis Pipeline. First we briefly describe a typical analysis pipeline Assume we have the genotype data of chro. mosome 1 in the file filename vcf gz in compressed vcf format To prepare the data for. hapFabia we have to perform preprocessing steps First filename vcf gz must be 1 uncom. pressed then 2 converted to the sparse matrix format 3 copy genotype matrix to the matrix that. is processed and then 4 split into intervals The following command line commands perform. these steps,1 gunzip filename vcf gz,2 vcftoFABIA filename. 3 cp filename matG txt filename mat txt, 4 split sparse matrix filename mat txt 10000 5000 1. In inst commandline arch command line tools for steps 2 to 4 are provided by the package. hapFabia However step 2 to 4 can be performed in R as well see below The commandline. parameters for vcftoFABIA are,1 filename without vcf.
2 path to the file e g, 3 optional s snps where snps gives the number of SNVs in the input data file. 4 optional o outputFileName which gives the prefix of the output files. The commandline parameters for split sparse matrix are. 1 filename without vcf,2 extension default mat txt. 3 interval length,4 shift size, 5 indicator whether annotation is present is generated by vcftoFABIA as default. The data is split into intervals of 10 000 SNVs where the distance between adjacent intervals is. 5 000 thus they overlap by 5 000 SNVs, After providing the file filename vcf the following steps constitute a typical analysis pipeline. 8 2 Getting Started,R define intervals overlap filename.
R shiftSize 5000,R intervalSize 10000,R fileName filename without type. R load library,R library hapFabia,R convert from vcf to mat txt step 2 above. R vcftoFABIA fileName fileName,copy genotype matrix to matrix step 3 above. R file copy paste fileName matG txt sep,paste fileName mat txt sep. R split generate intervals step 4 above, R split sparse matrix fileName fileName intervalSize intervalSize.
shiftSize shiftSize annotation TRUE,R compute how many intervals we have. R ina as numeric readLines paste fileName mat txt sep n 2. R noSNVs ina 2,R over intervalSize shiftSize,R N1 noSNVs shiftSize. R endRunA N1 over 2,R analyze each interval,R may be done by parallel runs. R iterateIntervals startRun 1 endRun endRunA shift shiftSize. intervalSize intervalSize fileName fileName individuals 0. upperBP 0 05 p 10 iter 40 alpha 0 03 cyc 50 IBDsegmentLength 50. Lt 0 1 Zt 0 2 thresCount 1e 5 mintagSNVsFactor 3 4. pMAF 0 035 haplotypes FALSE cut 0 8 procMinIndivids 0 1 thresPrune 1e 3. simv minD minTagSNVs 6 minIndivid 2 avSNVsDist 100 SNVclusterLength 100. R identify duplicates, R identifyDuplicates fileName fileName startRun 1 endRun endRunA. shift shiftSize intervalSize intervalSize,R analyze results parallel.
R anaRes analyzeIBDsegments fileName fileName startRun 1 endRun endRunA. shift shiftSize intervalSize intervalSize,R print Number IBD segments. R print anaRes noIBDsegments, R print Statistics on IBD segment lengths in SNVs all SNVs in the. IBD segment,2 Getting Started 9,R print anaRes avIBDsegmentLengthSNVS. R print Statistics on IBD segment lengths in bp,R print anaRes avIBDsegmentLengthS. R print Statistics on number of individuals that share an IBD segment. R print anaRes avnoIndividS, R print Statistics on number of IBD segment tagSNVs.
R print anaRes avnoTagSNVsS,R print Statistics on MAF of IBD segment tagSNVs. R print anaRes avnoFreqS, R print Statistics on MAF within the group of IBD segment tagSNVs. R print anaRes avnoGroupFreqS, R print Statistics on number of changes between major and minor allele frequency. R print anaRes avnotagSNVChangeS, R print Statistics on tagSNVs per individual that shares an IBD segment. R print anaRes avnotagSNVsPerIndividualS, R print Statistics on number of individuals that have the minor allele of tagSNVs.
R print anaRes avnoindividualPerTagSNVS,R load result for interval 50. R posAll 50 50 1 5000 245000 segment 245000 to 255000. R start posAll 1 shiftSize,R end start intervalSize. R pRange paste format start scientific FALSE,format end scientific FALSE sep. R load file paste fileName pRange resAnno Rda sep,R IBDsegmentList resHapFabia mergedIBDsegmentList. R summary IBDsegmentList,R plot IBD segments in interval 50.
R plot IBDsegmentList filename paste fileName pRange mat sep. R attention filename without type txt,R plot the first IBD segment in interval 50. R IBDsegment IBDsegmentList 1, R plot IBDsegment filename paste fileName pRange mat sep. R attention filename without type txt, First the packages hapFabia and fabia are loaded Then vcftoFABIA converts filename vcf. to sparse matrix format giving,filename matH txt haplotype data. filename matG txt genotype data,filename matD txt dosage data.
together with the SNV annotation file and individual s label file. 10 2 Getting Started,filename annot txt and,filename individuals txt. The function split sparse matrix splits the data into intervals The function iterateIntervals. identifies IBD segments in these intervals and stores the results in an EXCEL like csv format and. as an R data object The function identifyDuplicates marks and memorizes duplicates of IBD. segments which occur because the intervals overlap Next the function analyzeIBDsegments. analyzes the results where duplicates as marked in previous step are not considered Results are. listed by anaRes, The next example shows how to view all IBD segments of a segment for which we chose. interval 50 which corresponds to chromosome 1 range from 245 000 to 255 000 50 1 5000. 245000 Then we plot a specific IBD segment in this case the first IBDsegmentList 1. which can also be used to store a pdf or a fig for editing with Xfig Examples of this plot. function are given in Fig 3 and Fig 4, An R source file pipeline R of above pipeline can be created and executed as follows. by means of its factors At a locus overlapping IBD segments in one diploid individual a different IBD segment in each of the two chromosomes are represented through additivity of biclusters in the FABIA model Examples of short IBD segments found by hapFabia in chromosome 1 data from the 1000 Genomes Project are given in Fig 3 and Fig 4

Related Books