To help you get a feel for what's possible with Jillion. This User Guide provides an overview of the major modules of Jillion as well as descriptions and code examples using key classes and interfaces in those modules.
Focused on Abstraction
Jillion code heavily uses abstraction. Most non-value objects should only be referred to by its interface or abstract class. This allows several implementations to exist and get passed around without breaking any user code.
Most Jillion objects are immutable - they can't be modified once they have been built. This makes coding simpler since most of the code doesn't have to worry about variables changing state. Multi-threading is usually easier and more performant since there were be fewer places in the code that need to be synchronized.
Most Jillion objects are built using the Builder pattern. This lets objects be built over a series of different "steps". The constructor parameters for Jillion Builders are usually only the required parameters. There may be several additional optional configuration methods that the user may wish to also call on the Builder to set optional parameters. Once all the parameters have been set, calling the Builder.build() method will return a new instance of the type.
The above Sequence diagram shows the general useage of a Builder. This example creates a new XBuilder instance (a made up class) that will build an immutable instance of type X. After setting some optional parameters, the user invokes the build() method which will return a new X instance.
Some Jillion Builders may even use the configuration parameters to pick which implementation will get built (kind of combining the Builder pattern with the AbstractFactory pattern).
The org.jcvi.jillion.core module contains common genomic objects that are useful to all genomic and bioinformatic investigations. This is the primary module of this library. All other modules are dependent on the core.
Important interfaces/ classes:
- Range - Range is an object representing an immutable pair of coordinates which describes a contiguous subset of values. Ranges are used throughout the code base to represent everything from trim points, sub sequence ranges, alignment coordinates, read and/or contig locations in a scaffold.
- Rangeable - is an interface that means the object can be represented as a Range.
- Sequence - A Sequence abstracts how a list of objects is stored which allows a variety of implementations using various encoding and compression methods to store a sequence. The most common Sequence classes are for repersenting Nucleotide and Quality sequences.
- DataStores - A DataStore is an abstraction for a repository for multiple genomic objects that can be fetched by an ID. This allows various DataStore implementations to store these objects either all in memory, store byte offsets into various files, or possibly look up values in a database or URLs to websites.
- StreamingIterator an Iterator that is also closable. Closing the iterator allows it to clean up its resources and provides a clean way to break iteration without leaking resources.
Parsing Files - If using DataStores that wrap genomic data files is not sufficient for your particular usecase, the data files may be parsed directly. Jillion can parse many different genomic file formats including binary encoded files. Each parser uses a “push approach” event notification system similar to the Visitor pattern.
The Fasta Module org.jcvi.jillion.fasta contains classes for reading and writing FASTA encoded files for nucleotide, protein, quality and sanger position sequences. DataStores can be used to represent mulit-fasta files.
FastaRecord - An individual read from a fasta file is referred to as a FastaRecord which contains an ID, a sequence and an optional comment. There are FastaRecord implementations for nucleotide, protein, quality or sanger position sequences.
FastaDataStore - Object representation used to represent a multi-fasta file.
FastaWriter - The FastaWriter interface is used to write out Fasta encoded files. There are implementations for nucleotide, protein, quality or sanger position FastaRecords.
Jillion considers a trace to be a genomic object that has an ID as well as nucleotide and quality sequences. This makes the output of most sequencing machines "traces". The org.jcvi.jillion.trace module and its file format specific subpackages support many different trace file formats for both sanger and next-gen sequencers.
Next-Gen Trace Files Supported
The Fastq page explains all the classes and capabilities of Jillion's fastq package. Jillion can read and write Fastq files encoded in SANGER, ILLUMINA or SOLEXA quality formats. It is also possible to convert from any of these formats into any other format.
Sanger Trace Files Supported
The Chromatogram page explains how Jillion can read and write chromatogram objects encoded in ztr, scf formats as well as read abi formatted chromatogram files. It is also possible to convert from any of these formats into any other writable format.
The org.jcvi.jillion.assembly module contains classes for working with output from genome assemblers. Jillion can handle contigs created by de-novo assemblers or by reference assemblers. It is even possible to create contig objects from "scratch".
- Contig Objects - A Jillion Contig object is the base class that all contigs derive from. Contig objects have the consensus sequence and all the underlying read alignments and gapped sequences that provide coverage for the consensus.
- CoverageMap - A CoverageMap is an Object that contains coverage information for a contiguous range of offset values. CoverageMaps can be created from cotngis to get the depth of coverage at each point in the contig or from any collection of objects that implement the Rangeable interface.
- SliceMap - Get the Slice representation of a Contig. Slices can be used for variant detection and consensus recalling.
- Consensus Recalling - Jillion supports many different consensus calling algorithms which can be used to change a contig's consensus.
- Contig Builders - Contig objects are immutable. Use ContigBuilders to modify already existing objects or to create new contigs from "scratch".
- Contig DataStores - DataStore implementations that wrap assembly files for many common assemblers including:
- Consed/Phrap ace - the .ace file is the file format used by the Phrap assembler and consed contig viewer/editor
- Celera Assembler asm the .asm file is the output from the Celera Assembler
- CLC Bio cas the .cas file is the output from the CLC Bio Assembly-Cell reference assembler.
- TIGR tasm the .tasm file format is the output from the TIGR Assembler.
- TIGR contig the .contig file format a contig file format mostly used by internal TIGR applications. It is a more concise output compared to the more verbose .tasm file.
The org.jcvi.jillion.maq module contains classes for working with binary encoded MAQ formats such as .bfq and .bfa files.
- Binary Fastq - package for reading and writing .bfq files.
- Binary Fasta - package for reading and writing .bfa files.
Jillion now supports reading and writing SAM and BAM files. The
org.jcvi.jillion.sam module contains classes for reading and writing SAM and BAM files. Jillion can also re-sort SAM and BAM as well as read and write BAM indexes (.bai files).
- SamHeader - class for working reading, writing and modifying information stored in a SAM or BAM header.
- Cigar - package for working with CIGAR data.
- SamRecord - class that represents a single line in a SAM file.
- Parsing SAM and BAM files
- SamFileDataStore - A special DataStore implementation that represents a single SAM or BAM file.
- SamWriter - classes and interfaces for writing SAM and BAM files.