Category:Cookbook

From Jillion
Jump to: navigation, search

Contents

Cook Book

This Cook Book will show code examples using the Jillion Library to perform common bioinformatics tasks. Simple code examples will be listed on this page. More complicated examples link out to their own pages.

Working With NucleotideSequences and NucleotideSequenceBuilders

This section deals with modifying a NucleotideSequence by using a NucleotideSequenceBuilder.

How to ungap a Sequence

If the sequence is already stored as a NucleotideSequence, you must use the NucleotideSequenceBuilder constructor that takes a NucleotideSequence object:

NucleotideSequence originalSequence = ...
 
NucleotideSequence ungappedSequence = new NucleotideSequenceBuilder(originalSequence)
				                    .ungap()				                    .build();

If the sequence is a String, you must use the NucleotideSequenceBuilder constructor that takes a String object:

NucleotideSequence ungappedSequence = new NucleotideSequenceBuilder("AC-GT-AC-GT")
				                    .ungap()				                    .build();
 
//ungappedSequence will now contain "ACGTACGT"


How to reverse complement a Sequence

NucleotideSequenceBuilders have a method called reverseComplement() which will reverse complement the current sequence stored in the builder. Any future additions to the sequence will not be auto-complemented.

NucleotideSequence complementedSeq = new NucleotideSequenceBuilder("AAAAGGGG")
				                    .reverseComplement()				                    .build();
 
//complementedSeq will now contain "CCCCTTTT"

How to perform multiple changes to a sequence

NucleotideSequenceBuilders use method chaining to allow users to write more intent revealing code:

NucleotideSequence seq = new NucleotideSequenceBuilder("AA--AAG-GG-G")
				                    .reverseComplement()                                                    .ungap()                                                    .append("NNNN")				                    .build();
 
//seq will now contain "CCCCTTTTNNNN"

Working with QualitySequences and QualitySequenceBuilders

Jillion has a QualitySequenceBuilder which is very similar to the NucleotideSequenceBuilder and allows users to programatically insert/delete/append/prepend PhredQuality objects

How to create a QualitySequence from a byte array of quality values

QualitySequenceBuilder has a constructor that takes a byte array which assumes each value in the byte array is a phred quality score. This byte array should not be encoded: a byte value of 20 means QV 20 etc.

QualitySequence quals = new QualitySequenceBuilder(new byte[]{20,30,40,40,50,60})
//the 4th offset (in zero based)
//has a quality value of 50
PhredQuality quality =quals.get(4);
 
assertEquals(50, quality.getQualityScore());
 
//a QV score of 50 has an error probability of 10 ^(-5) = 0.00001
//due to floating point approximations
//the value might actually get stored as .000009 instead so we provide a delta range in this assertion
assertEquals(0.00001D, quality.getErrorProbability() , 0.00001D);

How to reverse a QualitySequence

If a use case requires manually coupling a read's sequence and quality values AND the sequence is to be reverse complemented, don't forget to also reverse the corresponding quality values. This should rarely need to be done; most Jillion objects will automatically reverse quality sequences for you when needed (for example writing out an ace file which contains the quality values for underlying read data).

NucleotideSequenceBuilder seqBuilder = new NucleotideSequenceBuilder(...);
 
 
QualitySequenceBuilder qualBuilder = new QualitySequenceBuilder(...);
 
seqBuilder.reverseComplement();
qualBuilder.reverse(); 
 
NucleotideSequence reverseComplementedSequence = seqBuilder.build();
QualitySequence reversedQualities = qualBuilder.build();

Working with Sanger Traces

Jillion Chromatogram objects are object representations of sanger chromatogram trace files. Users can read in one chromatogram format and then write out the data in a different chromatogram format. In this way, Jillion provides similar functionality to the Staden IO_Lib C library.

Parse Chromatogram file and get the sequence and quality values

File traceFile = new File("path/to/trace.scf");
 
Chromatogram chromo = ChromatogramFactory.create(traceFile);
 
NucleotideSequence seq = chromo.getNucleotideSequence();
QualitySequence quals = chromo.getQualitySequence();


Parse Ab1 file and write it out as a ZTR or SCF encoded file

The following code parses an AB1 trace file into a Chromatogram object and then uses 2 ChromatogramWriter implementations to re-encode the trace as both an SCF and a ZTR file.

File ab1 = new File("path/to/trace.ab1");
//creates Abi chromatogram and sets name to file name without the file extension
Chromatogram chromo = ChromatogramFactory.create(ab1);
 
String name = chromo.getId();
 
//use Java 7 try-with-resource to auto-close writers
try( ChromatogramWriter scfWriter = new ScfChromatogramWriterBuilder(new File(name + ".scf"))					
 					                .build();
 
      ChromatogramWriter ztrWriter = new ZtrChromatogramWriterBuilder(new File(name + ".ztr"))
){
    scfWriter.write(chromo);
    ztrWriter.write(chromo);
}

Add or Remove Comments from Chromatogram file

Using the ChromatogramBuilder objects, you can easily add or remove comments (or bases and qualities).

File ab1 = new File("path/to/trace.ab1");
//creates Abi chromatogram and sets name to file name without the file extension
Chromatogram chromo = ChromatogramFactory.create(ab1);
 
//now add more comments
Map<String,String> myComments = new HashMap<String, String>(chromo.getComments());myComments.put("extraComment", "value"); 
ScfChromatogram updatedChromo = new ScfChromatogramBuilder(chromo)
					.comments(myComments)					.build();
 
try(ChromatogramWriter scfWriter = new ScfChromatogramWriterBuilder(new File(chromo.getId() +".scf")) 
				                .build();		 
){
    scfWriter.write(updatedChromo);
}

Working with Contigs

How to get the Gapped contig consensus sequence

Jillion contig objects store the GAPPED consensus sequence. This is often much longer than the ungapped but allows for all the underlying reads to correctly align to the consensus. Luckily the NucletoideSequence interface has methods to convert from gapped to ungapped coordinates (and vice versa).

Contig contig = //...;
 
NucleotideSequence consensus = contig.getConsensusSequence();

How to get the Ungapped contig consensus sequence

Most of the time, other bio-informatics applications refer to only the ungapped consensus positions

Contig contig = //...;
 
NucleotideSequence consensus = contig.getConsensusSequence();
 
//get ungapped position 100 (residue based)
//remember that Jillion uses 0-based offsets, so subtract 1 from positions!
int ungappedPosition100 = consensus.getUngappedOffsetFor(100 -1);
 
//get gapped offset from ungapped position 1234
int gappedOffset = consensus.getGappedOffsetFor(1234 -1);

How to create Coverage Map

using ContigCoverageMapBuilder

The ContigCoverageMapBuilder allows you to build CoverageMaps from Contig objects.

To use all the reads from a contig in the CoverageMap simply create a new ContigCoverageMapBuilder then call its build() method:

AceContig contig = //... get contig object somehow
 
CoverageMap<AceAssembledRead> coverageMap = new ContigCoverageMapBuilder<>(contig)
						    .build();
restrict CoverageMap to only have specified max allowed coverage

CoverageMapBuilders have an optional configuration method called maxAllowedCoverage(int) that will limit the depth of any CoverageRegions in the CoveragreMap to be built. Any read whose addition to the CoverageMap would cause ANY CoverageRegion's depth of coverage to go over that threshold will not be included.

AceContig contig = //... get contig object somehow
 
CoverageMap<AceAssembledRead> coverageMap = new ContigCoverageMapBuilder<>(contig)
                                                    .maxAllowedCoverage(30)						    .build();

The above code sample limits the CoverageMap to have a max depth of coverage of 30. Any read in the contig that would cause the coverage to go above 30x anywhere is thrown out of the CoverageMap.

Advanced Contig Read Filtering

The org.jcvi.jillion.assembly.util.coverage.ContigCoverageMapBuilder.ReadFilter interface is used to filter reads that are to be considered for inclusion in the CoverageMap. A Read that passes the filter may still not be included in the final CoverageMap if Builder also uses the .maxAllowedCoverage() method AND if including that read will cause the coverage depth to go over the max threshold.

public  interface ReadFilter<R extends AssembledRead>{
 
    boolean accept(R read);
}

The ReadFilter interface is very simple. There is only one method: accept(read). The accept method should return true if the read should be included and false if the read should be excluded.

AceContig contig = //... get contig object somehow
 
 
ReadFilter<AssembledRead> readFilter = new ReadFilter<AssembledRead>() {     
 
        @Override
        public boolean accept(AceAssembledRead read) {
            return read.getDirection()==Direction.FORWARD;
        }
    };
 
CoverageMap<AceAssembledRead> coverageMap = new ContigCoverageMapBuilder<>(contig)
                                                    .filter(readFilter)
						    .build();

This code example creates a read filter that will only include the FORWARD reads in the CoverageMap.

See ContigCoverageMapBuilder in the user guide for more information.

Read Filtering with Jillion 5 Lambdas

Jillion 5 modified the ReadFilter interface to extend the Java 8's Predicate interface. This makes it even easier to write filtering expressions:

 
CoverageMap<AceAssembledRead> coverageMap = new ContigCoverageMapBuilder<>(contig)
                                                    .filter(read -> read.getDirection() == Direction.FORWARD)						    .build();

You no longer have to write custom subclasses of ReadFilter

How to find low coverage regions

The following code block will find all coverage regions in the CoverageMap that have less than 10x coverage.

CoverageMap<T> coverageMap = //...;
 
for(CoverageRegion<T> region : coverageMap){
  if(region.getCoverageDepth() < 10){
	 //found low coverage region
  }
}

How do I find variants in underlying read data

SliceMaps can be used to get variant information about contigs. Each vertical Slice in the SliceMap corresponds to a single position in the contig. Slice objects have a method called getNucleotideCounts() which returns a Map<Nucleotide, Integer> of the number of each Nucleotide present in the slice. (Nucleotides not present will have entry in the map with a value of 0).

AceContig contig = ...
QualitySequenceDataStore readQualities = ...
 
SliceMap sliceMap = new SliceMapBuilder<>(contig, readQualities)
		           .build();
 
Slice slice = sliceMap.getSlice(1234);
 
Map<Nucleotide, Integer> counts =slice.getNucleotideCounts();
 
int numAs = counts.get(Nucleotide.Adenine);
int numCs = counts.get(Nucleotide.Cytosine);

The above code sample will get the number of A's and C's for the given contig at gapped offset 1234. Slice objects also allow the user to get the SliceElement data to see the read ids, Direction, quality value and basecall of each read that contributes to the slice.

How do I filter a SliceMap or get variant data for a subset of reads in a contig?

If your assembly contains read data produced by different sequencing technologies it can be useful to compare the variant data for each technology separately or look at the data by different directions in order to detect sequencing errors and biases which are technology dependent. There are two ways to perform such analyses using SliceMaps:

Using ReadFilters

SliceMapBuilders also support ReadFilters to filter which reads are included in the SliceMap being built.

For example, lets say we want to compare the variant data by direction. We can write a ReadFilter to only accept reads in a specific direction like so:

ReadFilter<AssembledRead> forwardFilter = new ReadFilter<AssembledRead>() {
 
    @Override
    public boolean accept(AssembledRead read) {
        return read.getDirection() == Direction.Forward;
    }
 
};

This filter can be passed to a SliceMapBuilder.filter() method that will make the resulting built SliceMap only contain reads that matched the filter.

By creating two different filter objects, one that filters Forward reads and one that filters Reverse Reads, we can create two sliceMap objects that will only contain the forward reads and reverse reads respectively.

AceContig contig = ...
 
QualitySequenceDataStore readQualities =...;
 
SliceMap forwardSliceMap = new SliceMapBuilder<>(contig, readQualities)
		               .filter(forwardFilter)		               .build();
 
SliceMap reverseSliceMap = new SliceMapBuilder<>(contig, readQualities)
		               .filter(reverseFilter)		               .build();

Now each slice can be compared to see how different the forward and reverse assemblies are.

Slice forwardSlice = forwardSliceMap.getSlice(1234);
Slice reverseSlice = reverseSliceMap.getSlice(1234);
 
Map<Nucleotide, Integer> forwardCounts =forwardSlice.getNucleotideCounts();
Map<Nucleotide, Integer> reverseCounts =reverseSlice.getNucleotideCounts();
Without using ReadFilters

Creating multiple SliceMaps will take up more memory and will require more runtime (because the contig has to be examined multiple times, once for each SliceMap being built). The advantage to using ReadFilters is it produces more intent revealing and cleaner code. However, it is possible to perform any of these analyses using a single unfiltered SliceMap.

Since each SliceElement in a Slice contains the read id, nucleotide, quality and direction, we can examine each SliceElement in a slice and bin them however we want. To continue with the example from the previous section, we will bin the reads in a Slice by direction.

AceContig contig = ...
 
QualitySequenceDataStore readQualities =...;
 
SliceMap sliceMap = new SliceMapBuilder<>(contig, readQualities)
		               .build();
 
 
Slice slice = sliceMap.getSlice(1234);
 
List<SliceElement> forwardElements = new ArrayList<>();
List<SliceElement> reverseElements = new ArrayList<>();
for(SliceElement element : slice){
  if(element.getDirection() == Direction.FORWARD){
     forwardElements.add(element);
  }else{
     reverseElements.add(element);
  }
}
//here we have binned all the elements from a single slice 
//by direction
 
//..compares forwardElements with reverseElements

DataStores

Randomly accessing records in a datastore

iterating over all records

DataStores that wrap Files

Working with Fasta File DataStores

Jillion provides FastaFileDataStore implementations that can work with nucleotide sequences, quality sequences and amino acid sequences.

Working with Fastq File DataStores

Jillion can wrap a Fastq file in a DataStore using the FastqFileDataStoreBuilder in all major quality encoding formats See Fastq Quality Encodings for more information.

Working With Sff File DataStores

The SffFileDataStore object wraps a DataStore around a single sff encoded file.

Working With SAM and BAM files

The following code examples use Jillion to Parser SAM and BAM files and then write them out to new SAM or BAM files. See The User Guide's Sam Module for more details about the API.


Parsing SAM and BAM files

Jillion uses a SamParser object to parse both SAM and BAM formatted files. The easiest way to create a new SamParser object is to use the SamParserFactory class which has factory methods for creating parser instances from a File object. The factory methods work with both SAM and BAM encoded files.

File samFile =...
SamParser parser = SamParserFactory.create(samFile);

Same Methods Work on BAM Too

The factory methods can create SamParser instances that work on both SAM and BAM files. The type is determined by the file extension.

//same code as if it was a SAM or BAM
File bamFile =...
SamParser parser = SamParserFactory.create(bamFile);

Creating Parser For BAM File That has an Index File

If the BAM file has a corresponding .bai index file, the factory methods will use a different more efficient SamParser implementation that can quickly randomly access specific references alignments. ( See examples below for use cases).

If the .bai follows the naming convention of the same file name and directory as the BAM file but the index has an extra .bai, then the normal create() methods will find it. This naming convention is the default output index file name for SamTool's index command.


File bamFile = new File("myFile.bam");
//if myFile.bam.bai also exists, then it will automatically be found and used.
SamParser parser = SamParserFactory.create(bamFile);

Creating Parser For BAM File That has an Alternately Named Index File

If the bai file is in a different directory or named differently then there is another create method that you can use (added in Jillion 5.0) createUsingIndex (File, File) that takes both a BAM file and a bai index file.

File bamFile =...
File bai = ...
SamParser parser = SamParserFactory.createUsingIndex(bamFile, bai);

Visiting All the records from a SAM or BAM file

You can pass a SamVisitor instance to a SamParser to visit all the records contained in the file, even the reads that did not map. This is equivalent to the Samtools view command without any additional filtering options.

File bamFile =...
SamParser parser = SamParserFactory.create(bamFile);
 
parser.parse(new AbstractSamVisitor(){				
 
				@Override
				public void visitRecord(SamVisitorCallback callback,
						SamRecord record, VirtualFileOffset start,
						VirtualFileOffset end) {
					//do stuff with the record object
                                        ...
 
				} 
                } );


Visiting Only the Records that Mapped to a Specific Reference in a SAM or BAM file

You can pass a SamVisitor instance to a SamParser to visit only the records contained in the file that mapped to a specific reference id. This is equivalent to the Samtools view command with a REGIONS option with a reference name.

File bamFile =...
 
String referenceName = "chr1";
 
SamParser parser = SamParserFactory.create(bamFile);
 
//note the extra parameter with the referenceNameparser.parse(referenceName, new AbstractSamVisitor(){				
 
				@Override
				public void visitRecord(SamVisitorCallback callback,
						SamRecord record, VirtualFileOffset start,
						VirtualFileOffset end) {
					//do stuff with the record object
                                        ...
 
				} 
                } );

If the bam file happened to be a sorted bam file that has an index, then the SamParser instance is smart enough to use that index to quickly seek to the alignments for just that reference.

Visiting Only the Records that Mapped to a Specific Reference in a particular Region

You can pass a SamVisitor instance to a SamParser to visit only the records contained in the file that mapped to a specific reference id. This is equivalent to the Samtools view command with a REGIONS option with a reference name and region coordinates. To specify the region, use the jillion Range object.

File bamFile =...
 
String referenceName = "chr1";
 
//inclusive
Range region = Range.of(CoordinateSystem.RESIDUE, 1000,2000); 
SamParser parser = SamParserFactory.create(bamFile);
 
//note the extra parameters with the referenceName and region
parser.parse(referenceName, region, new AbstractSamVisitor(){				 
				@Override
				public void visitRecord(SamVisitorCallback callback,
						SamRecord record, VirtualFileOffset start,
						VirtualFileOffset end) {
					//do stuff with the record object
                                        ...
 
				} 
                } );

If the bam file happened to be a sorted bam file that has an index, then the SamParser instance is smart enough to use that index to quickly seek to the alignments for just that reference in that region.

Convert Fastq file into SAM or BAM

It is very easy to convert a Fastq file into an unmapped SAM or BAM file. The general idea is to create a SamWriter instance, and for each FastqRecord to write, create a new SamRecord instance with the values from the FastqRecord.

The following program duplicates the functionality of Picard's FastqToSam program. Most of the options in The Picard program are used to set fields in the header. This can all be done programmatically with Jillion's SamHeaderBuilder.

Create new SamHeader

For simplicity the code examples will put all the reads into the same ReadGroup, called "A" but Jillion gives you the freedom to do whatever you want.

 
    //create new SamHeader object using any 
    SamHeader header = new SamHeaderBuilder()
			      .setVersion(new SamVersion(1, 5))
			      .addProgram(new SamProgramBuilder("Fastq2Sam")
						.setVersion("1.0")
						.build())
			      .addReadGroup(new ReadGroupBuilder("A")
						.setSampleOrPoolName("my_sample")
							.build())
                              //add any other programs, read groups or comments here
			      .build();
Resort SAM or BAM File

Most programs that use unmapped SAM or BAM data want the records sorted by queryname. So when we create the SamWriter, we tell the builder that we want to re-sort the records by SortOrder.QUERY_NAME. If the re-sort value is different than the sort order specified in the header, then the new re-sort order is written out to the file. The SamWriter will handle sorting the records for us so we don't have to worry about the order of SamRecords we give it.

The SamFileWriterBuilder will determine if it should write out the data in SAM or BAM format based on what the file extension is in the output file provided; so make sure your file ends in either ".sam" or ".bam".

    File outputFile = //...file that ends in either ".sam" or ".bam".
 
    SamWriter samWriter = new SamFileWriterBuilder(outputFile, header)
			         .reSortBy(SortOrder.QUERY_NAME)
			         .build();
Writing out Fastq data as SAM or BAM

Now we iterate over the fastq file(s) and create a SamRecord for each read:

 
    try( FastqDataStore datastore = new FastqFileDataStoreBuilder(fastqFile)
							.qualityCodec(FastqQualityCodec.SANGER)
							.hint(DataStoreProviderHint.ITERATION_ONLY)							.build();
 
         StreamingIterator<FastqRecord> iter = datastore.iterator();
         ){
	     while(iter.hasNext()){
		   FastqRecord fastq = iter.next();
		   SamRecord samRecord =new SamRecord.Builder(header)
						.setFlags(EnumSet.of(SamRecordFlags.READ_UNMAPPED))						.setQueryName(fastq.getId())
						.setQualities(fastq.getQualitySequence())
						.setSequence(fastq.getNucleotideSequence())
						.addAttribute(new SamAttribute(
                                                                     ReservedSamAttributeKeys.READ_GROUP, "A"))						.build();
 
 
                    samWriter.writeRecord(samRecord);
		}
          }
 
         //now we are done writing close samWriter (preferably in a finally block)
         samWriter.close();

Notice that we give the FastqDataStore a hint that we will only be iterating so it can use a more efficient implementation. And we set make sure the each SamRecord we create sets the READ_UNMAPPED SamRecordFlag. Finally, we add an attribute to each SamRecord to say that the read belongs to Read group A which was defined in the SamHeader.



Writing Paired Data

If you want to write 2 Fastq Files that are paired data, then the only change you have to make is to modify the SamRecordFlags used to also say that it is part of a pair, and either the 1st or 2nd in the pair and that its mate is unmapped.

Read 1's flags:

EnumSet.of(
           SamRecordFlags.READ_UNMAPPED,
           SamRecordFlags.HAS_MATE_PAIR,													    
           SamRecordFlags.FIRST_MATE_OF_PAIR,           SamRecordFlags.MATE_UNMAPPED));

Read 2's flags:

EnumSet.of(
           SamRecordFlags.READ_UNMAPPED,
           SamRecordFlags.HAS_MATE_PAIR,													    
           SamRecordFlags.SECOND_MATE_OF_PAIR,           SamRecordFlags.MATE_UNMAPPED));

Creating an index file from a sorted BAM file

BAM index files (.bai files) can be used to speed up random access of reads contained in a BAM file by their coordinates. BAM index files only work on BAM files that have been sorted in COORDINATE order.

If you want to create a BAM index file from an already sorted BAM file, all you need to do is create a new BamIndexFileWriterBuilder: passing in the required input sorted BAM and output bai file to write, then call the build() method. Optionally, the includeMetaData() method can be called to include additional metadata that is not yet in the BAM index file specification but are supported by both samtools and picard.

File bamFile = new File("/path/to/bam");
File baiFile = new File("/path/to/output/bai");
 
new BamIndexFileWriterBuilder(bamFile, baiFile)
			.includeMetaData(true) //includes metadata that Picard and samtools use
			.build();

This writerBuilder will examine the header in the BAM file to make sure that it is sorted correctly. If the header says the sort order is something other than COORDINATE, then build() will throw an IllegalStateException.

If you know the BAM is really sorted even if the header says otherwise (this may happen on some old version of samtools ?) the BamIndexFileWriterBuilder has an optional method to turn the sort validation off

File bamFile = new File("/path/to/bam");
File baiFile = new File("/path/to/output/bai");
 
new BamIndexFileWriterBuilder(bamFile, baiFile)
			.includeMetaData(true) //includes metadata that Picard and samtools use
                        .assumeSorted(true)			.build();


Creating BAI index file While Writing the BAM File

Jillion can also write out the .bai file at the same time it is writing out the sorted BAM file. This saves you an extra step in your pipelines since you don't have to call either Samtool's index command or manually use a BamIndexFileWRiterBuilder object after the fact.

All you have to do is add the extra configuration line createBamIndex(boolean) to your WriterBuilder object. See Simultaneously Creating a BAM index for more details.


File outputBam = //...file that ends in either ".bam".
 
    SamWriter samWriter = new SamFileWriterBuilder(outputBam, header)
			         .reSortBy(SortOrder.QUERY_NAME)
                                 .createBamIndex(true)			         .build();

Putting It All Together

This section combines classes and modules explained above and puts them together to solve real world bioinformatics problems. These code snippets can be used to replace more complicated pipelines that string together dozens of different programs that each write out temp files. Writing out those temp files and converting those files to the appropriate formats for the next step in the pipelines adds a lot of processing time and makes the pipelines hard to maintain.

By using Jillion instead, a single cohesive program can be written instead without any temporary files or extra output formats.

Trimming Input Sequencing Data

Pipelines are commonly used to process input sequencing data and trimming it or re-encoding it to prepare it for genomic assembly.

Here's how to use Jillion to accomplish these tasks:

Finding Good Quality Vector Free Sequences From Sanger Traces

Sanger sequencing is still used even in a Next-Gen world. Sanger sequencing is still used to finish and close genomes especially to get the termini or untranslated regions around genes.

Unfortunately, many Next-Generation assemblers do not understand the legacy sanger trace formats so they have to be converted into Fasta files to be used.

Sanger reads also have different quality profiles so quality trimmers that are tuned for next-gen reads (like BWA Trimmer) will not produce optional trimming ranges if it was run on sanger data. There might also be vector or primer sequences in the sanger reads that need to be trimmed off before the data is assembled.

Using Jillion, we can convert the sanger read files (in ab1, ztr or scf formats) into the equivalent sequence and quality fasta files and also perform vector and quality trimming at the same time so that the fasta files only contain the region of the reads that is both free of any vector and also in the good quality range. For this example, we will use the Jillion Lucy Trimming classes which are a reimplementation of the TIGR program Lucy.


Given you have a bunch of sanger trace files in a List<File> called traceFiles and vector splice sequences that may appear upstream and downstream in your reads, here's how the program works:

  1. Create output Nucleotide and Quality Fasta Files where the trimmed sequences will be written
  2. For each Sanger Trace:
    1. parse the traceFile into a Chromatogram object.
    2. call the lucy vector trimmer on the nucleotide sequence to get a Range for that read that is free of any vector sequence. Note this may be the whole sequence if the vector wasn't found in either direction.
    3. call the lucy quailty trimmer on the quality sequence to get a Range for that read that is only the "good" quality portion
    4. We only care about the part of the read that is both free of vector and in the good quality portion. The Jillion Range object has a method "intersect(Range other)" which return a new Range that is only the region that overlaps both input Ranges. This is our final "good" Range that we want to use in an assembly. TIGR calls this region the "Clear Range".
    5. If no bases overlap both the vector free range and the good quality range, then the intersection returns an empty range. We don't need to write out an empty sequence so we can skip the read if the clearRange is empty.
    6. Using Jillion SequenceBuilders we can trim both the nucleotide and quality sequences to be just the clear Range region.
    7. Write out the trimmed sequences to the output fasta writers.
        List<File> traceFiles = //...
 
        NucleotideSequence upStreamSpliceSeq =    //...
        NucleotideSequence downstreamSpliceSeq =  //...
 
        File seqOutFile = new File("/path/to/seq.fasta");
        File qualOutFile = new File("/path/to/qual.fasta");
 
        NucleotideTrimmer vectorTrimmer = new LucyVectorSpliceTrimmerBuilder(upStreamSpliceSeq, downstreamSpliceSeq)
                                                    .build();
 
        QualityTrimmer qualityTrimmer = new LucyQualityTrimmerBuilder(30)
                                                    .addTrimWindow(30, 0.01F)
                                                    .addTrimWindow(10, 0.035F)
                                                    .build();
 
 
        try(NucleotideFastaWriter seqFastaWriter = new NucleotideFastaWriterBuilder(seqOutFile)
                                                            .build();
           QualityFastaWriter qualFastaWriter = new QualityFastaWriterBuilder(qualOutFile)
                                                               .build();
 
        ){
            for(File traceFile : traceFiles){
                //works for ab1, ztr or scf encoded files
                Chromatogram chromo = ChromatogramFactory.create(traceFile);
 
                Range vectorFreeRange = vectorTrimmer.trim(chromo.getNucleotideSequence());
                Range goodQuailtyRange = qualityTrimmer.trim(chromo.getQualitySequence());
                //the "clearRange" is the only part of the
                //sequence that is in the good quality region AND free of vector
                Range clearRange = vectorFreeRange.intersection(goodQuailtyRange);
 
                if(clearRange.isEmpty()){
                    //completely trimmed!
                    //don't write anything out.
                    continue;
                }
                String id = chromo.getId();
 
                NucleotideSequence trimmedSeq = chromo.getNucleotideSequence()
                                                        .toBuilder()
                                                        .trim(clearRange)
                                                        .build();
 
                QualitySequence trimmedQual = chromo.getQualitySequence()
                                                    .toBuilder()
                                                    .trim(clearRange)
                                                    .build();
 
                seqFastaWriter.write(id, trimmedSeq);
                qualFastaWriter.write(id, trimmedQual);
 
            }
 
        }//auto-close fasta writers

Trimming Fastq Reads with BWA Style Quality Trimming

The Burrows Wheeler Aligner program has a '-q' option that trims the input reads based on a sliding quality window. Jillion has a built in java implementation of this algorithm that can be used to trim Next-gen input read data.

The following will parse an input fastq file into a FastqDataStore object and then iterate over each FastqRecord, trim it using the BWA algorithm and then write the output to a trimmed fastq file if the trimmed sequence was >= 64bp. The output writer uses the same Fastq quality encoding as the input fastq (which was auto-detected) using the new FastqFileDataStore method getQualityCodec() which was added in Jillion 5.0.

Also notice that the input Fastq Datastore used the DataStoreProviderHint ITERATION_ONLY to let Jillion know that we are only going to be iterating through the file so use an implementation that doesn't need random access support which takes up lots of memory. This is especially useful if the fastq is huge.

File inputFastq = //...
 
File outputFastq = //...
 
//minimum length a read after trimming must be
//to be written to the output file
int MIN_LENGTH = 64;
 
QualityTrimmer bwaTrimmer = new BwaQualityTrimmer(PhredQuality.valueOf(20));
 
try(FastqFileDataStore fastqDataStore = new FastqFileDataStoreBuilder(inputFastq)
                                                        .hint(DataStoreProviderHint.ITERATION_ONLY)
                                                        .build();
    //output writer uses  same qualityCodec as input fastq used (Sanger or Illumina)
    FastqWriter writer = new FastqWriterBuilder(outputFastq)
                                                .qualityCodec(fastqDataStore.getQualityCodec())
                                                .build();
 
    StreamingIterator<FastqRecord> iter = fastqDataStore.iterator();
){
    while(iter.hasNext()){
        FastqRecord record = iter.next();
        Range trimRange = bwaTrimmer.trim(record.getQualitySequence());
 
        //only include read if after trimming it meets our minimum threshold 
        if(trimRange.getLength() >= MIN_LENGTH){
 
            //use SequenceBuilders to trim the sequences down to just 
            //the good quality region
            NucleotideSequence trimmedSeq = record.getNucleotideSequence()
                                                        .toBuilder()
                                                        .trim(trimRange)
                                                        .build();
 
            QualitySequence trimmedQual = record.getQualitySequence()
                                                        .toBuilder()
                                                        .trim(trimRange)
                                                        .build();
            //write new fastq record of just the trimmed 
            writer.write(record.getId(), trimmedSeq, trimmedQual);
        }
    }
}//autoclose writer

Subcategories

This category has only the following subcategory.

Pages in category "Cookbook"

The following 3 pages are in this category, out of 3 total.

Personal tools
Namespaces

Variants
Actions
Navigation
Javadoc
Community
Toolbox