Get Number of Reads at Every Base From Bam File
how to obtain base counts from BAM file at specific genomic coordinates 2 @herve-pages-1542 Concluding seen 5 hours ago Seattle, WA, Us Howdy Kim, On 10/xviii/2013 11:04 PM, Kim Siegmund wrote: > Dearest Herve, > > I piece of work with Tim Triche Jr. at USC and he told me you lot were the best person to contact about a particular programming question I have. I would like to estimate the SNP variant frequencies in a BAM file at a list of specific locations. Specifically, I take a file with a list of somatic SNP variant locations, e.g. > chr1 31184771 > chr1 40229367 > chr1 48699354 > ?. (~500 rows), > and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd like to know the base calls in the BAM file at the specific locations from the file of somatic variant positions. Is there an easy way to extract this information from the BAM file? If yeah, can I besides go the quality score for each base of operations? > > I am familiar with GenomicRanges & ReadGappedAlignment, but perhaps at only an intermediate level. I was told this is easy in samtools (is information technology the mpileup command?) only cannot find any examples showing how this might be done in Bioconductor. Although it seems the package VariantTools must have this capability, I did non see from the vignette how I might do this targeted information summarization. Maybe you could ask Michael about VariantTools. Here beneath is a pure Biostrings + GenomicRanges solution. Information technology uses the pileLettersAt() office, which isn't in any of those packages yet, but volition be included in a parcel to be added soon to Bioc-devel. Here is how to use pileLettersAt(): Input: - A BAM file: bamfile <- BamFile(arrangement.file("extdata", "ex1.bam", packet="Rsamtools")) seqinfo(bamfile) # to meet the seqlevels and seqlengths stackStringsFromBam(bamfile, param="seq1:1-21") # a quick look at # the reads - A GRanges object containing single nucleotide positions: snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)), IRanges(c(1:5, 21, 1575, 1513:1514), width=one)) Some preliminary massage on 'snpos': seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile)) seqlevels(snpos) <- seqlevelsInUse(snpos) Load the BAM file in a GAlignments object. We load simply the reads aligned to the sequences in 'seqlevels(snpos)' and we filter out reads not passing quality controls (flag bit 0x200) and PCR or optical duplicates (flag chip 0x400). Run across ?ScanBamParam and the SAM Spec for more information. which <- as(seqinfo(snpos), "GRanges") flag <- scanBamFlag(isNotPassingQualityControls=FALSE, isDuplicate=FALSE) what <- c("seq", "qual") param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which) gal <- readGAlignmentsFromBam(bamfile, param=param) seqlevels(gal) <- seqlevels(snpos) Excerpt the read sequences (a.k.a. query sequences) and quality strings. Both are reported with respect to the + strand. qseq <- mcols(gal)$seq qual <- mcols(gal)$qual nucl_piles <- pileLettersAt(qseq, seqnames(gal), beginning(gal), cigar(gal), snpos) qual_piles <- pileLettersAt(qual, seqnames(gal), beginning(gal), cigar(gal), snpos) mcols(snpos)$nucl_piles <- nucl_piles mcols(snpos)$qual_piles <- qual_piles Then: > snpos GRanges with 9 ranges and 2 metadata columns: seqnames ranges strand | nucl_piles <rle> <iranges> <rle> | <dnastringset> [i] seq1 [ i, 1] * | C [2] seq1 [ 2, 2] * | A [3] seq1 [ iii, three] * | CC [4] seq1 [ 4, four] * | TT [five] seq1 [ 5, 5] * | AAA [6] seq1 [ 21, 21] * | TTTTTTTTT [7] seq1 [1575, 1575] * | [8] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG [9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT qual_piles <bstringset> [ane] < [two] < [iii] << [4] << [5] <<< [half-dozen] <<<-<;<<= [7] [8] <<<<;vi:<=4<;16<<'+:. [9] <<<;;=<<=<<<<<<:;'84;;<<;+1 --- seqlengths: seq1 seq2 1575 1584 Finally, to summarize A/C/G/T frequencies at each position: > alphabetFrequency(nucl_piles, baseOnly=True) A C G T other [one,] 0 ane 0 0 0 [2,] i 0 0 0 0 [iii,] 0 2 0 0 0 [4,] 0 0 0 2 0 [5,] 3 0 0 0 0 [6,] 0 0 0 ix 0 [7,] 0 0 0 0 0 [eight,] 0 0 18 2 0 [nine,] 1 0 0 26 0 Encounter below for pileLettersAt's code. Cheers, H. ---------------------------------------------------------------------- - ### Conceptually equivalent to: ### sapply(x, function(twenty) paste(xx, collapse="")) collapse_XStringSetList <- function(10) { unlisted_x <- unlist(x, use.names=Imitation) unlisted_ans <- unlist(unlisted_x, employ.names=FALSE) ans_width <- sum(relist(width(unlisted_x), x)) extract_at <- equally(PartitioningByEnd(cumsum(ans_width)), "IRanges") extractAt(unlisted_ans, extract_at) } ### pileLettersOnSingleRefAt() is the workhorse behind pileLettersAt(). ### 'x', 'pos', 'cigar': 3 parallel vectors describing N strings aligned ### to the same reference sequence. 'x' must exist an XStringSet (typically ### DNAStringSet) object containing the unaligned strings (a.thousand.a. the ### query sequences) reported with respect to the + strand. 'pos' must ### be an integer vector where 'pos[i]' is the 1-based position on the ### reference sequence of the commencement aligned letter in 'x[[i]]'. 'cigar' ### must be a character vector containing the extended CIGAR strings. ### 'at': must be an integer vector containing single nucleotide ### positions on the reference sequence. ### Returns an XStringSet (typically DNAStringSet) object parallel to ### 'at' (i.e. with 1 string per unmarried nucleotide position). pileLettersOnSingleRefAt <- part(x, pos, cigar, at) { stopifnot(is(x, "XStringSet")) N <- length(ten) # nb of alignments stopifnot(is.integer(pos) && length(pos) == Due north) stopifnot(is.graphic symbol(cigar) && length(cigar) == N) stopifnot(is.integer(at)) ops <- c("Chiliad", "=", "X") ranges_on_ref <- cigarRangesAlongReferenceSpace(cigar, pos=pos, ops=ops) ranges_on_query <- cigarRangesAlongQuerySpace(cigar, ops=ops) ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects parallel ## to 'x', 'pos', and 'cigar'. In addition, the ii IRangesList objects ## have the same "shape" (i.eastward. same elementLengths()), so, after ## unlisting, the 2 unlisted objects are parallel IRanges objects. unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=False) unlisted_ranges_on_query <- unlist(ranges_on_query, utilize.names=FALSE) ## 2 integer vectors parallel to IRanges objects 'unlisted_ranges_on_ref' ## and 'unlisted_ranges_on_query' above. range_group <- togroup(ranges_on_ref) query2ref_shift <- kickoff(unlisted_ranges_on_ref) - start(unlisted_ranges_on_query) hits <- findOverlaps(at, unlisted_ranges_on_ref) hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(hits)] hits_group <- range_group[subjectHits(hits)] unlisted_piles <- subseq(x[hits_group], offset=hits_at_in_x, width=1L) piles_skeleton <- PartitioningByEnd(queryHits(hits), NG=length(at), names=names(at)) piles <- relist(unlisted_piles, piles_skeleton) collapse_XStringSetList(piles) } ### '10', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing N ### aligned strings. 'x', 'pos', and 'cigar' every bit in a higher place. 'seqnames' must ### be a gene-Rle where 'seqnames[i]' is the name of the reference ### sequence of the i-th alignment. ### 'at': must exist a GRanges object containing single nucleotide ### positions. 'seqlevels(at)' must be identical to 'levels(seqnames)'. ### Returns an XStringSet (typically DNAStringSet) object parallel to ### 'at' (i.e. with 1 string per single nucleotide position). pileLettersAt <- function(x, seqnames, pos, cigar, at) { stopifnot(is(ten, "XStringSet")) N <- length(x) # nb of alignments stopifnot(is(seqnames, "Rle")) stopifnot(is.gene(runValue(seqnames))) stopifnot(length(seqnames) == N) stopifnot(is.integer(pos) && length(pos) == Northward) stopifnot(is.character(cigar) && length(cigar) == N) stopifnot(is(at, "GRanges")) stopifnot(all(width(at) == 1L)) stopifnot(identical(seqlevels(at), levels(seqnames))) ## We process 1 chromosome at a time. Then we start by splitting ## 'x', 'pos', 'cigar', and 'start(at)' by chromosome. The 4 ## resulting listing-like objects have 1 list chemical element per chromosome ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical ## to 'seqlevels(at)'). x_by_chrom <- split(x, seqnames) # XStringSetList pos_by_chrom <- carve up(pos, seqnames) # IntegerList cigar_by_chrom <- divide(cigar, seqnames) # CharacterList at_by_chrom <- split(start(at), seqnames(at)) # IntegerList ## Unsplit index. split_idx <- unlist(split(seq_along(at), seqnames(at)), apply.names=False) unsplit_idx <- integer(length(at)) unsplit_idx[split_idx] <- seq_along(at) do.call("c", lapply(seq_along(seqlevels(at)), function(i) pileLettersOnSingleRefAt( x_by_chrom[[i]], pos_by_chrom[[i]], cigar_by_chrom[[i]], at_by_chrom[[i]])))[unsplit_idx] } > > Thank you in advance for any advice. > > Kim Siegmund > -- Hervé Pagès Program in Computational Biology Segmentation of Public Health Sciences Fred Hutchinson Cancer Enquiry Centre 1100 Fairview Ave. Due north, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 @martin-morgan-1513 Last seen three days ago United States On 10/22/2013 02:58 AM, Hervé Pagès wrote: > Hi Kim, > > On 10/18/2013 11:04 PM, Kim Siegmund wrote: >> Dear Herve, >> >> I work with Tim Triche Jr. at USC and he told me you were the best person to >> contact well-nigh a particular programming question I have. I would like to >> estimate the SNP variant frequencies in a BAM file at a list of specific >> locations. Specifically, I have a file with a list of somatic SNP variant >> locations, e.g. >> chr1 31184771 >> chr1 40229367 >> chr1 48699354 >> ?. (~500 rows), If yous're interested in nucleotide counts, then ?applyPileups will provide this. Martin >> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd similar to >> know the base calls in the BAM file at the specific locations from the file of >> somatic variant positions. Is in that location an easy style to extract this data >> from the BAM file? If yes, can I also become the quality score for each base? >> >> I am familiar with GenomicRanges & ReadGappedAlignment, merely perhaps at only an >> intermediate level. I was told this is easy in samtools (is it the mpileup >> command?) but cannot observe any examples showing how this might be done in >> Bioconductor. Although information technology seems the packet VariantTools must accept this >> adequacy, I did non encounter from the vignette how I might do this targeted data >> summarization. > > Maybe you could enquire Michael about VariantTools. > > Hither beneath is a pure Biostrings + GenomicRanges solution. It uses > the pileLettersAt() part, which isn't in any of those packages > notwithstanding, just will be included in a package to be added presently to Bioc- devel. > > Here is how to use pileLettersAt(): > > Input: > > - A BAM file: > > bamfile <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools")) > > seqinfo(bamfile) # to run into the seqlevels and seqlengths > > stackStringsFromBam(bamfile, param="seq1:1-21") # a quick look at > # the reads > > - A GRanges object containing single nucleotide positions: > > snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)), > IRanges(c(ane:5, 21, 1575, 1513:1514), width=one)) > > Some preliminary massage on 'snpos': > > seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile)) > seqlevels(snpos) <- seqlevelsInUse(snpos) > > Load the BAM file in a GAlignments object. Nosotros load only the reads > aligned to the sequences in 'seqlevels(snpos)' and we filter out > reads not passing quality controls (flag bit 0x200) and PCR or > optical duplicates (flag flake 0x400). See ?ScanBamParam and the SAM > Spec for more than information. > > which <- equally(seqinfo(snpos), "GRanges") > flag <- scanBamFlag(isNotPassingQualityControls=Simulated, > isDuplicate=FALSE) > what <- c("seq", "qual") > param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which) > gal <- readGAlignmentsFromBam(bamfile, param=param) > seqlevels(gal) <- seqlevels(snpos) > > Extract the read sequences (a.m.a. query sequences) and quality > strings. Both are reported with respect to the + strand. > > qseq <- mcols(gal)$seq > qual <- mcols(gal)$qual > > nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), snpos) > qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal), snpos) > mcols(snpos)$nucl_piles <- nucl_piles > mcols(snpos)$qual_piles <- qual_piles > > Then: > > > snpos > GRanges with 9 ranges and ii metadata columns: > seqnames ranges strand | nucl_piles > <rle> <iranges> <rle> | <dnastringset> > [1] seq1 [ 1, i] * | C > [two] seq1 [ 2, ii] * | A > [3] seq1 [ 3, 3] * | CC > [4] seq1 [ 4, 4] * | TT > [5] seq1 [ five, 5] * | AAA > [6] seq1 [ 21, 21] * | TTTTTTTTT > [seven] seq1 [1575, 1575] * | > [8] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG > [9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT > qual_piles > <bstringset> > [1] < > [2] < > [3] << > [iv] << > [v] <<< > [6] <<<-<;<<= > [vii] > [viii] <<<<;6:<=iv<;16<<'+:. > [9] <<<;;=<<=<<<<<<:;'84;;<<;+i > --- > seqlengths: > seq1 seq2 > 1575 1584 > > Finally, to summarize A/C/Thou/T frequencies at each position: > > > alphabetFrequency(nucl_piles, baseOnly=Truthful) > A C G T other > [1,] 0 1 0 0 0 > [ii,] 1 0 0 0 0 > [iii,] 0 2 0 0 0 > [4,] 0 0 0 two 0 > [v,] 3 0 0 0 0 > [6,] 0 0 0 9 0 > [seven,] 0 0 0 0 0 > [viii,] 0 0 18 2 0 > [9,] 1 0 0 26 0 > > Encounter below for pileLettersAt's code. > > Cheers, > H. > > > -------------------------------------------------------------------- --- > > ### Conceptually equivalent to: > ### sapply(10, role(xx) paste(xx, collapse="")) > > collapse_XStringSetList <- function(10) > { > unlisted_x <- unlist(x, apply.names=FALSE) > unlisted_ans <- unlist(unlisted_x, use.names=FALSE) > ans_width <- sum(relist(width(unlisted_x), x)) > extract_at <- as(PartitioningByEnd(cumsum(ans_width)), "IRanges") > extractAt(unlisted_ans, extract_at) > } > > ### pileLettersOnSingleRefAt() is the workhorse behind pileLettersAt(). > ### 'ten', 'pos', 'cigar': 3 parallel vectors describing N strings aligned > ### to the same reference sequence. '10' must be an XStringSet (typically > ### DNAStringSet) object containing the unaligned strings (a.k.a. the > ### query sequences) reported with respect to the + strand. 'pos' must > ### be an integer vector where 'pos[i]' is the 1-based position on the > ### reference sequence of the first aligned alphabetic character in 'ten[[i]]'. 'cigar' > ### must be a character vector containing the extended CIGAR strings. > ### 'at': must be an integer vector containing single nucleotide > ### positions on the reference sequence. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.eastward. with 1 string per single nucleotide position). > > pileLettersOnSingleRefAt <- function(x, pos, cigar, at) > { > stopifnot(is(ten, "XStringSet")) > Due north <- length(ten) # nb of alignments > stopifnot(is.integer(pos) && length(pos) == Northward) > stopifnot(is.character(cigar) && length(cigar) == N) > stopifnot(is.integer(at)) > > ops <- c("Thou", "=", "X") > ranges_on_ref <- cigarRangesAlongReferenceSpace(cigar, pos=pos, ops=ops) > ranges_on_query <- cigarRangesAlongQuerySpace(cigar, ops=ops) > > ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects parallel > ## to 'x', 'pos', and 'cigar'. In addition, the 2 IRangesList objects > ## have the same "shape" (i.eastward. aforementioned elementLengths()), so, after > ## unlisting, the 2 unlisted objects are parallel IRanges objects. > unlisted_ranges_on_ref <- unlist(ranges_on_ref, apply.names=Faux) > unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=Faux) > > ## 2 integer vectors parallel to IRanges objects 'unlisted_ranges_on_ref' > ## and 'unlisted_ranges_on_query' higher up. > range_group <- togroup(ranges_on_ref) > query2ref_shift <- start(unlisted_ranges_on_ref) - > start(unlisted_ranges_on_query) > > hits <- findOverlaps(at, unlisted_ranges_on_ref) > hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(hits)] > hits_group <- range_group[subjectHits(hits)] > unlisted_piles <- subseq(x[hits_group], commencement=hits_at_in_x, width=1L) > piles_skeleton <- PartitioningByEnd(queryHits(hits), NG=length(at), > names=names(at)) > piles <- relist(unlisted_piles, piles_skeleton) > collapse_XStringSetList(piles) > } > > ### 'x', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing Northward > ### aligned strings. 'ten', 'pos', and 'cigar' every bit above. 'seqnames' must > ### be a gene-Rle where 'seqnames[i]' is the name of the reference > ### sequence of the i-th alignment. > ### 'at': must be a GRanges object containing single nucleotide > ### positions. 'seqlevels(at)' must exist identical to 'levels(seqnames)'. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.e. with ane string per single nucleotide position). > > pileLettersAt <- role(x, seqnames, pos, cigar, at) > { > stopifnot(is(ten, "XStringSet")) > N <- length(x) # nb of alignments > stopifnot(is(seqnames, "Rle")) > stopifnot(is.factor(runValue(seqnames))) > stopifnot(length(seqnames) == Due north) > stopifnot(is.integer(pos) && length(pos) == Due north) > stopifnot(is.grapheme(cigar) && length(cigar) == N) > stopifnot(is(at, "GRanges")) > stopifnot(all(width(at) == 1L)) > stopifnot(identical(seqlevels(at), levels(seqnames))) > > ## We process 1 chromosome at a time. And then we first past splitting > ## 'x', 'pos', 'cigar', and 'first(at)' by chromosome. The iv > ## resulting list-like objects accept 1 list chemical element per chromosome > ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical > ## to 'seqlevels(at)'). > x_by_chrom <- divide(x, seqnames) # XStringSetList > pos_by_chrom <- split(pos, seqnames) # IntegerList > cigar_by_chrom <- split(cigar, seqnames) # CharacterList > at_by_chrom <- split(start(at), seqnames(at)) # IntegerList > > ## Unsplit index. > split_idx <- unlist(split(seq_along(at), seqnames(at)), > employ.names=FALSE) > unsplit_idx <- integer(length(at)) > unsplit_idx[split_idx] <- seq_along(at) > > do.telephone call("c", lapply(seq_along(seqlevels(at)), > function(i) > pileLettersOnSingleRefAt( > x_by_chrom[[i]], > pos_by_chrom[[i]], > cigar_by_chrom[[i]], > at_by_chrom[[i]])))[unsplit_idx] > } > > >> >> Thank you in advance for whatsoever communication. >> >> Kim Siegmund >> > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 @michael-lawrence-3846 Terminal seen 12 weeks ago United States On Tue, October 22, 2013 at 2:58 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Kim, > > On x/18/2013 11:04 PM, Kim Siegmund wrote: > >> Dear Herve, >> >> I piece of work with Tim Triche Jr. at USC and he told me yous were the all-time person >> to contact about a particular programming question I take. I would like to >> estimate the SNP variant frequencies in a BAM file at a list of specific >> locations. Specifically, I accept a file with a list of somatic SNP variant >> locations, e.1000. >> chr1 31184771 >> chr1 40229367 >> chr1 48699354 >> . (~500 rows), >> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd like >> to know the base of operations calls in the BAM file at the specific locations from the >> file of somatic variant positions. Is there an easy way to extract this >> information from the BAM file? If yes, can I also get the quality score >> for each base of operations? >> >> I am familiar with GenomicRanges & ReadGappedAlignment, only perchance at >> only an intermediate level. I was told this is piece of cake in samtools (is it the >> mpileup command?) just cannot find any examples showing how this might be >> done in Bioconductor. Although it seems the package VariantTools must have >> this capability, I did not meet from the vignette how I might do this >> targeted data summarization. >> > > Possibly yous could ask Michael nigh VariantTools. > > For using VariantTools, you'll need a genome indexed for GMAP/GSNAP. It tin can be created from e.k. a BSgenome object: genome <- GmapGenome(BSgenome.Hsapiens.UCSC.hg19, create=TRUE) That will probably require more 8 GB of RAM, so be careful, but it will only generate the index once (information technology is cached on deejay). Then: param <- TallyVariantsParam(genome, which = your_snp_locations) vr <- tallyVariants(bam, param) The returned value is a VRanges, which is an extension of GRanges for working with variants. Here beneath is a pure Biostrings + GenomicRanges solution. It uses > the pileLettersAt() function, which isn't in whatsoever of those packages > yet, but will be included in a package to be added soon to Bioc- devel. > > Here is how to utilise pileLettersAt(): > > Input: > > - A BAM file: > > bamfile <- BamFile(system.file("extdata", "ex1.bam", > package="Rsamtools")) > > seqinfo(bamfile) # to encounter the seqlevels and seqlengths > > stackStringsFromBam(bamfile, param="seq1:i-21") # a quick wait at > # the reads > > - A GRanges object containing unmarried nucleotide positions: > > snpos <- GRanges(Rle(c("seq1", "seq2"), c(seven, 2)), > IRanges(c(1:5, 21, 1575, 1513:1514), width=1)) > > Some preliminary massage on 'snpos': > > seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile)) > seqlevels(snpos) <- seqlevelsInUse(snpos) > > Load the BAM file in a GAlignments object. We load but the reads > aligned to the sequences in 'seqlevels(snpos)' and we filter out > reads non passing quality controls (flag bit 0x200) and PCR or > optical duplicates (flag bit 0x400). Run into ?ScanBamParam and the SAM > Spec for more information. > > which <- as(seqinfo(snpos), "GRanges") > flag <- scanBamFlag(**isNotPassingQualityControls=**FALSE, > isDuplicate=Simulated) > what <- c("seq", "qual") > param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which) > gal <- readGAlignmentsFromBam(**bamfile, param=param) > seqlevels(gal) <- seqlevels(snpos) > > Extract the read sequences (a.m.a. query sequences) and quality > strings. Both are reported with respect to the + strand. > > qseq <- mcols(gal)$seq > qual <- mcols(gal)$qual > > nucl_piles <- pileLettersAt(qseq, seqnames(gal), outset(gal), cigar(gal), > snpos) > qual_piles <- pileLettersAt(qual, seqnames(gal), first(gal), cigar(gal), > snpos) > mcols(snpos)$nucl_piles <- nucl_piles > mcols(snpos)$qual_piles <- qual_piles > > And then: > > > snpos > GRanges with ix ranges and 2 metadata columns: > seqnames ranges strand | nucl_piles > <rle> <iranges> <rle> | <dnastringset> > [1] seq1 [ 1, 1] * | C > [ii] seq1 [ 2, 2] * | A > [iii] seq1 [ 3, three] * | CC > [four] seq1 [ 4, 4] * | TT > [5] seq1 [ 5, 5] * | AAA > [6] seq1 [ 21, 21] * | TTTTTTTTT > [seven] seq1 [1575, 1575] * | > [viii] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG > [9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT > qual_piles > <bstringset> > [i] < > [two] < > [three] << > [four] << > [five] <<< > [6] <<<-<;<<= > [7] > [8] <<<<;half dozen:<=iv<;16<<'+:. > [9] <<<;;=<<=<<<<<<:;'84;;<<;+one > --- > seqlengths: > seq1 seq2 > 1575 1584 > > Finally, to summarize A/C/G/T frequencies at each position: > > > alphabetFrequency(nucl_piles, baseOnly=TRUE) > A C G T other > [1,] 0 1 0 0 0 > [two,] ane 0 0 0 0 > [iii,] 0 2 0 0 0 > [four,] 0 0 0 2 0 > [five,] 3 0 0 0 0 > [6,] 0 0 0 9 0 > [7,] 0 0 0 0 0 > [8,] 0 0 18 2 0 > [9,] 1 0 0 26 0 > > Run into below for pileLettersAt's code. > > Cheers, > H. > > > ------------------------------**------------------------------** > ----------- > > ### Conceptually equivalent to: > ### sapply(x, part(xx) paste(xx, collapse="")) > > collapse_XStringSetList <- part(10) > { > unlisted_x <- unlist(x, utilise.names=FALSE) > unlisted_ans <- unlist(unlisted_x, utilize.names=Faux) > ans_width <- sum(relist(width(unlisted_x), x)) > extract_at <- as(PartitioningByEnd(cumsum(**ans_width)), "IRanges") > extractAt(unlisted_ans, extract_at) > } > > ### pileLettersOnSingleRefAt() is the workhorse backside pileLettersAt(). > ### 'x', 'pos', 'cigar': 3 parallel vectors describing Northward strings aligned > ### to the aforementioned reference sequence. '10' must be an XStringSet (typically > ### DNAStringSet) object containing the unaligned strings (a.chiliad.a. the > ### query sequences) reported with respect to the + strand. 'pos' must > ### be an integer vector where 'pos[i]' is the 1-based position on the > ### reference sequence of the first aligned alphabetic character in 'x[[i]]'. 'cigar' > ### must exist a graphic symbol vector containing the extended CIGAR strings. > ### 'at': must exist an integer vector containing single nucleotide > ### positions on the reference sequence. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.eastward. with 1 cord per single nucleotide position). > > pileLettersOnSingleRefAt <- office(x, pos, cigar, at) > { > stopifnot(is(x, "XStringSet")) > N <- length(10) # nb of alignments > stopifnot(is.integer(pos) && length(pos) == N) > stopifnot(is.character(cigar) && length(cigar) == Northward) > stopifnot(is.integer(at)) > > ops <- c("M", "=", "Ten") > ranges_on_ref <- cigarRangesAlongReferenceSpace**(cigar, pos=pos, > ops=ops) > ranges_on_query <- cigarRangesAlongQuerySpace(**cigar, ops=ops) > > ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects > parallel > ## to '10', 'pos', and 'cigar'. In addition, the ii IRangesList objects > ## accept the aforementioned "shape" (i.e. aforementioned elementLengths()), so, later > ## unlisting, the ii unlisted objects are parallel IRanges objects. > unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=FALSE) > unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=Imitation) > > ## 2 integer vectors parallel to IRanges objects > 'unlisted_ranges_on_ref' > ## and 'unlisted_ranges_on_query' above. > range_group <- togroup(ranges_on_ref) > query2ref_shift <- kickoff(unlisted_ranges_on_ref) - > first(unlisted_ranges_on_**query) > > hits <- findOverlaps(at, unlisted_ranges_on_ref) > hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(** > hits)] > hits_group <- range_group[subjectHits(hits)] > unlisted_piles <- subseq(x[hits_group], showtime=hits_at_in_x, width=1L) > piles_skeleton <- PartitioningByEnd(queryHits(**hits), NG=length(at), > names=names(at)) > piles <- relist(unlisted_piles, piles_skeleton) > collapse_XStringSetList(piles) > } > > ### 'x', 'seqnames', 'pos', 'cigar': iv parallel vectors describing N > ### aligned strings. 'ten', 'pos', and 'cigar' every bit to a higher place. 'seqnames' must > ### be a factor-Rle where 'seqnames[i]' is the proper name of the reference > ### sequence of the i-th alignment. > ### 'at': must be a GRanges object containing single nucleotide > ### positions. 'seqlevels(at)' must be identical to 'levels(seqnames)'. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.e. with i string per single nucleotide position). > > pileLettersAt <- function(x, seqnames, pos, cigar, at) > { > stopifnot(is(x, "XStringSet")) > N <- length(x) # nb of alignments > stopifnot(is(seqnames, "Rle")) > stopifnot(is.factor(runValue(**seqnames))) > stopifnot(length(seqnames) == N) > stopifnot(is.integer(pos) && length(pos) == N) > stopifnot(is.character(cigar) && length(cigar) == N) > stopifnot(is(at, "GRanges")) > stopifnot(all(width(at) == 1L)) > stopifnot(identical(seqlevels(**at), levels(seqnames))) > > ## We process 1 chromosome at a time. So we start by splitting > ## '10', 'pos', 'cigar', and 'start(at)' by chromosome. The four > ## resulting list-like objects have i list element per chromosome > ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical > ## to 'seqlevels(at)'). > x_by_chrom <- split(ten, seqnames) # XStringSetList > pos_by_chrom <- split(pos, seqnames) # IntegerList > cigar_by_chrom <- dissever(cigar, seqnames) # CharacterList > at_by_chrom <- separate(beginning(at), seqnames(at)) # IntegerList > > ## Unsplit alphabetize. > split_idx <- unlist(split(seq_along(at), seqnames(at)), > utilize.names=FALSE) > unsplit_idx <- integer(length(at)) > unsplit_idx[split_idx] <- seq_along(at) > > do.telephone call("c", lapply(seq_along(seqlevels(at)**), > function(i) > pileLettersOnSingleRefAt( > x_by_chrom[[i]], > pos_by_chrom[[i]], > cigar_by_chrom[[i]], > at_by_chrom[[i]])))[unsplit_**idx] > } > > > >> Give thanks y'all in advance for whatsoever advice. >> >> Kim Siegmund >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Enquiry Center > 1100 Fairview Ave. Due north, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > Due east-post: hpages@fhcrc.org > Telephone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-projection.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the athenaeum: http://news.gmane.org/gmane.** > science.biology.computer science.**conductor<http: news.gmane.org="" gmane.="" scientific discipline.biology.computer science.conductor=""> > [[alternative HTML version deleted]]
galleghanpailtaild.blogspot.com
Source: https://support.bioconductor.org/p/55564/
0 Response to "Get Number of Reads at Every Base From Bam File"
Post a Comment