Get Number of Reads at Every Base From Bam File

Question

Answer 1

@martin-morgan-1513

Last seen three days ago

United States

On 10/22/2013 02:58 AM, Hervé Pagès wrote: > Hi Kim, > > On 10/18/2013 11:04 PM, Kim Siegmund wrote: >> Dear Herve, >> >> I work with Tim Triche Jr. at USC and he told me you were the best person to >> contact well-nigh a particular programming question I have. I would like to >> estimate the SNP variant frequencies in a BAM file at a list of specific >> locations. Specifically, I have a file with a list of somatic SNP variant >> locations, e.g. >> chr1 31184771 >> chr1 40229367 >> chr1 48699354 >> ?. (~500 rows), If yous're interested in nucleotide counts, then ?applyPileups will provide this. Martin >> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd similar to >> know the base calls in the BAM file at the specific locations from the file of >> somatic variant positions. Is in that location an easy style to extract this data >> from the BAM file? If yes, can I also become the quality score for each base? >> >> I am familiar with GenomicRanges & ReadGappedAlignment, merely perhaps at only an >> intermediate level. I was told this is easy in samtools (is it the mpileup >> command?) but cannot observe any examples showing how this might be done in >> Bioconductor. Although information technology seems the packet VariantTools must accept this >> adequacy, I did non encounter from the vignette how I might do this targeted data >> summarization. > > Maybe you could enquire Michael about VariantTools. > > Hither beneath is a pure Biostrings + GenomicRanges solution. It uses > the pileLettersAt() part, which isn't in any of those packages > notwithstanding, just will be included in a package to be added presently to Bioc- devel. > > Here is how to use pileLettersAt(): > > Input: > > - A BAM file: > > bamfile <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools")) > > seqinfo(bamfile) # to run into the seqlevels and seqlengths > > stackStringsFromBam(bamfile, param="seq1:1-21") # a quick look at > # the reads > > - A GRanges object containing single nucleotide positions: > > snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)), > IRanges(c(ane:5, 21, 1575, 1513:1514), width=one)) > > Some preliminary massage on 'snpos': > > seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile)) > seqlevels(snpos) <- seqlevelsInUse(snpos) > > Load the BAM file in a GAlignments object. Nosotros load only the reads > aligned to the sequences in 'seqlevels(snpos)' and we filter out > reads not passing quality controls (flag bit 0x200) and PCR or > optical duplicates (flag flake 0x400). See ?ScanBamParam and the SAM > Spec for more than information. > > which <- equally(seqinfo(snpos), "GRanges") > flag <- scanBamFlag(isNotPassingQualityControls=Simulated, > isDuplicate=FALSE) > what <- c("seq", "qual") > param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which) > gal <- readGAlignmentsFromBam(bamfile, param=param) > seqlevels(gal) <- seqlevels(snpos) > > Extract the read sequences (a.m.a. query sequences) and quality > strings. Both are reported with respect to the + strand. > > qseq <- mcols(gal)$seq > qual <- mcols(gal)$qual > > nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), snpos) > qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal), snpos) > mcols(snpos)$nucl_piles <- nucl_piles > mcols(snpos)$qual_piles <- qual_piles > > Then: > > > snpos > GRanges with 9 ranges and ii metadata columns: > seqnames ranges strand | nucl_piles > <rle> <iranges> <rle> | <dnastringset> > [1] seq1 [ 1, i] * | C > [two] seq1 [ 2, ii] * | A > [3] seq1 [ 3, 3] * | CC > [4] seq1 [ 4, 4] * | TT > [5] seq1 [ five, 5] * | AAA > [6] seq1 [ 21, 21] * | TTTTTTTTT > [seven] seq1 [1575, 1575] * | > [8] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG > [9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT > qual_piles > <bstringset> > [1] < > [2] < > [3] << > [iv] << > [v] <<< > [6] <<<-<;<<= > [vii] > [viii] <<<<;6:<=iv<;16<<'+:. > [9] <<<;;=<<=<<<<<<:;'84;;<<;+i > --- > seqlengths: > seq1 seq2 > 1575 1584 > > Finally, to summarize A/C/Thou/T frequencies at each position: > > > alphabetFrequency(nucl_piles, baseOnly=Truthful) > A C G T other > [1,] 0 1 0 0 0 > [ii,] 1 0 0 0 0 > [iii,] 0 2 0 0 0 > [4,] 0 0 0 two 0 > [v,] 3 0 0 0 0 > [6,] 0 0 0 9 0 > [seven,] 0 0 0 0 0 > [viii,] 0 0 18 2 0 > [9,] 1 0 0 26 0 > > Encounter below for pileLettersAt's code. > > Cheers, > H. > > > -------------------------------------------------------------------- --- > > ### Conceptually equivalent to: > ### sapply(10, role(xx) paste(xx, collapse="")) > > collapse_XStringSetList <- function(10) > { > unlisted_x <- unlist(x, apply.names=FALSE) > unlisted_ans <- unlist(unlisted_x, use.names=FALSE) > ans_width <- sum(relist(width(unlisted_x), x)) > extract_at <- as(PartitioningByEnd(cumsum(ans_width)), "IRanges") > extractAt(unlisted_ans, extract_at) > } > > ### pileLettersOnSingleRefAt() is the workhorse behind pileLettersAt(). > ### 'ten', 'pos', 'cigar': 3 parallel vectors describing N strings aligned > ### to the same reference sequence. '10' must be an XStringSet (typically > ### DNAStringSet) object containing the unaligned strings (a.k.a. the > ### query sequences) reported with respect to the + strand. 'pos' must > ### be an integer vector where 'pos[i]' is the 1-based position on the > ### reference sequence of the first aligned alphabetic character in 'ten[[i]]'. 'cigar' > ### must be a character vector containing the extended CIGAR strings. > ### 'at': must be an integer vector containing single nucleotide > ### positions on the reference sequence. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.eastward. with 1 string per single nucleotide position). > > pileLettersOnSingleRefAt <- function(x, pos, cigar, at) > { > stopifnot(is(ten, "XStringSet")) > Due north <- length(ten) # nb of alignments > stopifnot(is.integer(pos) && length(pos) == Northward) > stopifnot(is.character(cigar) && length(cigar) == N) > stopifnot(is.integer(at)) > > ops <- c("Thou", "=", "X") > ranges_on_ref <- cigarRangesAlongReferenceSpace(cigar, pos=pos, ops=ops) > ranges_on_query <- cigarRangesAlongQuerySpace(cigar, ops=ops) > > ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects parallel > ## to 'x', 'pos', and 'cigar'. In addition, the 2 IRangesList objects > ## have the same "shape" (i.eastward. aforementioned elementLengths()), so, after > ## unlisting, the 2 unlisted objects are parallel IRanges objects. > unlisted_ranges_on_ref <- unlist(ranges_on_ref, apply.names=Faux) > unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=Faux) > > ## 2 integer vectors parallel to IRanges objects 'unlisted_ranges_on_ref' > ## and 'unlisted_ranges_on_query' higher up. > range_group <- togroup(ranges_on_ref) > query2ref_shift <- start(unlisted_ranges_on_ref) - > start(unlisted_ranges_on_query) > > hits <- findOverlaps(at, unlisted_ranges_on_ref) > hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(hits)] > hits_group <- range_group[subjectHits(hits)] > unlisted_piles <- subseq(x[hits_group], commencement=hits_at_in_x, width=1L) > piles_skeleton <- PartitioningByEnd(queryHits(hits), NG=length(at), > names=names(at)) > piles <- relist(unlisted_piles, piles_skeleton) > collapse_XStringSetList(piles) > } > > ### 'x', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing Northward > ### aligned strings. 'ten', 'pos', and 'cigar' every bit above. 'seqnames' must > ### be a gene-Rle where 'seqnames[i]' is the name of the reference > ### sequence of the i-th alignment. > ### 'at': must be a GRanges object containing single nucleotide > ### positions. 'seqlevels(at)' must exist identical to 'levels(seqnames)'. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.e. with ane string per single nucleotide position). > > pileLettersAt <- role(x, seqnames, pos, cigar, at) > { > stopifnot(is(ten, "XStringSet")) > N <- length(x) # nb of alignments > stopifnot(is(seqnames, "Rle")) > stopifnot(is.factor(runValue(seqnames))) > stopifnot(length(seqnames) == Due north) > stopifnot(is.integer(pos) && length(pos) == Due north) > stopifnot(is.grapheme(cigar) && length(cigar) == N) > stopifnot(is(at, "GRanges")) > stopifnot(all(width(at) == 1L)) > stopifnot(identical(seqlevels(at), levels(seqnames))) > > ## We process 1 chromosome at a time. And then we first past splitting > ## 'x', 'pos', 'cigar', and 'first(at)' by chromosome. The iv > ## resulting list-like objects accept 1 list chemical element per chromosome > ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical > ## to 'seqlevels(at)'). > x_by_chrom <- divide(x, seqnames) # XStringSetList > pos_by_chrom <- split(pos, seqnames) # IntegerList > cigar_by_chrom <- split(cigar, seqnames) # CharacterList > at_by_chrom <- split(start(at), seqnames(at)) # IntegerList > > ## Unsplit index. > split_idx <- unlist(split(seq_along(at), seqnames(at)), > employ.names=FALSE) > unsplit_idx <- integer(length(at)) > unsplit_idx[split_idx] <- seq_along(at) > > do.telephone call("c", lapply(seq_along(seqlevels(at)), > function(i) > pileLettersOnSingleRefAt( > x_by_chrom[[i]], > pos_by_chrom[[i]], > cigar_by_chrom[[i]], > at_by_chrom[[i]])))[unsplit_idx] > } > > >> >> Thank you in advance for whatsoever communication. >> >> Kim Siegmund >> > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

Answer 2

@michael-lawrence-3846

Terminal seen 12 weeks ago

United States

On Tue, October 22, 2013 at 2:58 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Kim, > > On x/18/2013 11:04 PM, Kim Siegmund wrote: > >> Dear Herve, >> >> I piece of work with Tim Triche Jr. at USC and he told me yous were the all-time person >> to contact about a particular programming question I take. I would like to >> estimate the SNP variant frequencies in a BAM file at a list of specific >> locations. Specifically, I accept a file with a list of somatic SNP variant >> locations, e.1000. >> chr1 31184771 >> chr1 40229367 >> chr1 48699354 >> . (~500 rows), >> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd like >> to know the base of operations calls in the BAM file at the specific locations from the >> file of somatic variant positions. Is there an easy way to extract this >> information from the BAM file? If yes, can I also get the quality score >> for each base of operations? >> >> I am familiar with GenomicRanges & ReadGappedAlignment, only perchance at >> only an intermediate level. I was told this is piece of cake in samtools (is it the >> mpileup command?) just cannot find any examples showing how this might be >> done in Bioconductor. Although it seems the package VariantTools must have >> this capability, I did not meet from the vignette how I might do this >> targeted data summarization. >> > > Possibly yous could ask Michael nigh VariantTools. > > For using VariantTools, you'll need a genome indexed for GMAP/GSNAP. It tin can be created from e.k. a BSgenome object: genome <- GmapGenome(BSgenome.Hsapiens.UCSC.hg19, create=TRUE) That will probably require more 8 GB of RAM, so be careful, but it will only generate the index once (information technology is cached on deejay). Then: param <- TallyVariantsParam(genome, which = your_snp_locations) vr <- tallyVariants(bam, param) The returned value is a VRanges, which is an extension of GRanges for working with variants. Here beneath is a pure Biostrings + GenomicRanges solution. It uses > the pileLettersAt() function, which isn't in whatsoever of those packages > yet, but will be included in a package to be added soon to Bioc- devel. > > Here is how to utilise pileLettersAt(): > > Input: > > - A BAM file: > > bamfile <- BamFile(system.file("extdata", "ex1.bam", > package="Rsamtools")) > > seqinfo(bamfile) # to encounter the seqlevels and seqlengths > > stackStringsFromBam(bamfile, param="seq1:i-21") # a quick wait at > # the reads > > - A GRanges object containing unmarried nucleotide positions: > > snpos <- GRanges(Rle(c("seq1", "seq2"), c(seven, 2)), > IRanges(c(1:5, 21, 1575, 1513:1514), width=1)) > > Some preliminary massage on 'snpos': > > seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile)) > seqlevels(snpos) <- seqlevelsInUse(snpos) > > Load the BAM file in a GAlignments object. We load but the reads > aligned to the sequences in 'seqlevels(snpos)' and we filter out > reads non passing quality controls (flag bit 0x200) and PCR or > optical duplicates (flag bit 0x400). Run into ?ScanBamParam and the SAM > Spec for more information. > > which <- as(seqinfo(snpos), "GRanges") > flag <- scanBamFlag(**isNotPassingQualityControls=**FALSE, > isDuplicate=Simulated) > what <- c("seq", "qual") > param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which) > gal <- readGAlignmentsFromBam(**bamfile, param=param) > seqlevels(gal) <- seqlevels(snpos) > > Extract the read sequences (a.m.a. query sequences) and quality > strings. Both are reported with respect to the + strand. > > qseq <- mcols(gal)$seq > qual <- mcols(gal)$qual > > nucl_piles <- pileLettersAt(qseq, seqnames(gal), outset(gal), cigar(gal), > snpos) > qual_piles <- pileLettersAt(qual, seqnames(gal), first(gal), cigar(gal), > snpos) > mcols(snpos)$nucl_piles <- nucl_piles > mcols(snpos)$qual_piles <- qual_piles > > And then: > > > snpos > GRanges with ix ranges and 2 metadata columns: > seqnames ranges strand | nucl_piles > <rle> <iranges> <rle> | <dnastringset> > [1] seq1 [ 1, 1] * | C > [ii] seq1 [ 2, 2] * | A > [iii] seq1 [ 3, three] * | CC > [four] seq1 [ 4, 4] * | TT > [5] seq1 [ 5, 5] * | AAA > [6] seq1 [ 21, 21] * | TTTTTTTTT > [seven] seq1 [1575, 1575] * | > [viii] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG > [9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT > qual_piles > <bstringset> > [i] < > [two] < > [three] << > [four] << > [five] <<< > [6] <<<-<;<<= > [7] > [8] <<<<;half dozen:<=iv<;16<<'+:. > [9] <<<;;=<<=<<<<<<:;'84;;<<;+one > --- > seqlengths: > seq1 seq2 > 1575 1584 > > Finally, to summarize A/C/G/T frequencies at each position: > > > alphabetFrequency(nucl_piles, baseOnly=TRUE) > A C G T other > [1,] 0 1 0 0 0 > [two,] ane 0 0 0 0 > [iii,] 0 2 0 0 0 > [four,] 0 0 0 2 0 > [five,] 3 0 0 0 0 > [6,] 0 0 0 9 0 > [7,] 0 0 0 0 0 > [8,] 0 0 18 2 0 > [9,] 1 0 0 26 0 > > Run into below for pileLettersAt's code. > > Cheers, > H. > > > ------------------------------**------------------------------** > ----------- > > ### Conceptually equivalent to: > ### sapply(x, part(xx) paste(xx, collapse="")) > > collapse_XStringSetList <- part(10) > { > unlisted_x <- unlist(x, utilise.names=FALSE) > unlisted_ans <- unlist(unlisted_x, utilize.names=Faux) > ans_width <- sum(relist(width(unlisted_x), x)) > extract_at <- as(PartitioningByEnd(cumsum(**ans_width)), "IRanges") > extractAt(unlisted_ans, extract_at) > } > > ### pileLettersOnSingleRefAt() is the workhorse backside pileLettersAt(). > ### 'x', 'pos', 'cigar': 3 parallel vectors describing Northward strings aligned > ### to the aforementioned reference sequence. '10' must be an XStringSet (typically > ### DNAStringSet) object containing the unaligned strings (a.chiliad.a. the > ### query sequences) reported with respect to the + strand. 'pos' must > ### be an integer vector where 'pos[i]' is the 1-based position on the > ### reference sequence of the first aligned alphabetic character in 'x[[i]]'. 'cigar' > ### must exist a graphic symbol vector containing the extended CIGAR strings. > ### 'at': must exist an integer vector containing single nucleotide > ### positions on the reference sequence. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.eastward. with 1 cord per single nucleotide position). > > pileLettersOnSingleRefAt <- office(x, pos, cigar, at) > { > stopifnot(is(x, "XStringSet")) > N <- length(10) # nb of alignments > stopifnot(is.integer(pos) && length(pos) == N) > stopifnot(is.character(cigar) && length(cigar) == Northward) > stopifnot(is.integer(at)) > > ops <- c("M", "=", "Ten") > ranges_on_ref <- cigarRangesAlongReferenceSpace**(cigar, pos=pos, > ops=ops) > ranges_on_query <- cigarRangesAlongQuerySpace(**cigar, ops=ops) > > ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects > parallel > ## to '10', 'pos', and 'cigar'. In addition, the ii IRangesList objects > ## accept the aforementioned "shape" (i.e. aforementioned elementLengths()), so, later > ## unlisting, the ii unlisted objects are parallel IRanges objects. > unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=FALSE) > unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=Imitation) > > ## 2 integer vectors parallel to IRanges objects > 'unlisted_ranges_on_ref' > ## and 'unlisted_ranges_on_query' above. > range_group <- togroup(ranges_on_ref) > query2ref_shift <- kickoff(unlisted_ranges_on_ref) - > first(unlisted_ranges_on_**query) > > hits <- findOverlaps(at, unlisted_ranges_on_ref) > hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(** > hits)] > hits_group <- range_group[subjectHits(hits)] > unlisted_piles <- subseq(x[hits_group], showtime=hits_at_in_x, width=1L) > piles_skeleton <- PartitioningByEnd(queryHits(**hits), NG=length(at), > names=names(at)) > piles <- relist(unlisted_piles, piles_skeleton) > collapse_XStringSetList(piles) > } > > ### 'x', 'seqnames', 'pos', 'cigar': iv parallel vectors describing N > ### aligned strings. 'ten', 'pos', and 'cigar' every bit to a higher place. 'seqnames' must > ### be a factor-Rle where 'seqnames[i]' is the proper name of the reference > ### sequence of the i-th alignment. > ### 'at': must be a GRanges object containing single nucleotide > ### positions. 'seqlevels(at)' must be identical to 'levels(seqnames)'. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.e. with i string per single nucleotide position). > > pileLettersAt <- function(x, seqnames, pos, cigar, at) > { > stopifnot(is(x, "XStringSet")) > N <- length(x) # nb of alignments > stopifnot(is(seqnames, "Rle")) > stopifnot(is.factor(runValue(**seqnames))) > stopifnot(length(seqnames) == N) > stopifnot(is.integer(pos) && length(pos) == N) > stopifnot(is.character(cigar) && length(cigar) == N) > stopifnot(is(at, "GRanges")) > stopifnot(all(width(at) == 1L)) > stopifnot(identical(seqlevels(**at), levels(seqnames))) > > ## We process 1 chromosome at a time. So we start by splitting > ## '10', 'pos', 'cigar', and 'start(at)' by chromosome. The four > ## resulting list-like objects have i list element per chromosome > ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical > ## to 'seqlevels(at)'). > x_by_chrom <- split(ten, seqnames) # XStringSetList > pos_by_chrom <- split(pos, seqnames) # IntegerList > cigar_by_chrom <- dissever(cigar, seqnames) # CharacterList > at_by_chrom <- separate(beginning(at), seqnames(at)) # IntegerList > > ## Unsplit alphabetize. > split_idx <- unlist(split(seq_along(at), seqnames(at)), > utilize.names=FALSE) > unsplit_idx <- integer(length(at)) > unsplit_idx[split_idx] <- seq_along(at) > > do.telephone call("c", lapply(seq_along(seqlevels(at)**), > function(i) > pileLettersOnSingleRefAt( > x_by_chrom[[i]], > pos_by_chrom[[i]], > cigar_by_chrom[[i]], > at_by_chrom[[i]])))[unsplit_**idx] > } > > > >> Give thanks y'all in advance for whatsoever advice. >> >> Kim Siegmund >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Enquiry Center > 1100 Fairview Ave. Due north, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > Due east-post: hpages@fhcrc.org > Telephone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-projection.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the athenaeum: http://news.gmane.org/gmane.** > science.biology.computer science.**conductor<http: news.gmane.org="" gmane.="" scientific discipline.biology.computer science.conductor=""> > [[alternative HTML version deleted]]

Get Number of Reads at Every Base From Bam File

0 Response to "Get Number of Reads at Every Base From Bam File"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel