Main Page
Jillion
The Java Informatics Large Library fOr geNomics (Jillion) is a genomics software library written in Java to support bioinformatics. This library was created by a single Software Engineer at the J. Craig Venter Institute (JCVI) and used by several projects including The Influenza Genome Project, Leptospira and the Human Microbiome Project.
Jillion requires Java 6 or later however, Java 7 is recommended since some boiler plate code can be removed by using the new Java 7 feature "try-with-resource" to auto close many of Jillion's frequently used objects including DataStores, StreamingIterators and writers.
Jillion is open source and licensed under the GPL v3.0
Contents |
How is Jillion Different Than BioJava ?
BioJava is another Java library for bioinformatics that is similar to Jillion. Both libraries support some common bioinformatic read formats such as FASTA and FASTQ but there the similarities end. BioJava focuses mainly on input reads and genome annotation where as Jillion focuses on genome assembly. Jillion has object representations of contigs as well as parsers and writers for many common assembly file formats such as Consed's ACE format.
Sequence Support
Like BioJava, Jillion can handle various read input formats such as fasta, fastq, and scf encoded files, but Jillion can also natively handle other formats such as sff, ztr and abi chromatograms. Sequence objects also have different implementations depending on the use case and type of data. For example, a NucleotideSequence object which contains only the nucleotides A,C,G and T could represent each nucleotide as 2 bits each. A different implementation that stores each nucleotide as 4 bits would be used if the sequence contained ambiguous bases. Since quality sequences often have consecutive quality scores of the same value, a run length implementation can compactly store reads or even contig consensus qualities in only a few bytes.
| Format | Version | BioJava
Read |
BioJava
Write |
Jillion
Read |
Jillion
Write |
|---|---|---|---|---|---|
| Abi | Yes | Yes | |||
| Ztr | 1.2 | No | No | Yes | Yes |
| Scf | 2 | Yes | No | Yes | Yes |
| Scf | 3 | Yes | No | Yes | Yes |
| Format | Encoding | BioJava
Read |
BioJava
Write |
Jillion
Read |
Jillion
Write |
|---|---|---|---|---|---|
| Fasta | nucleotide | Yes | Yes | Yes | Yes |
| Fasta | qualities | No | No | Yes | Yes |
| Fastq | sanger/solexa/illumina | Yes | Yes | Yes | Yes |
| sff | No | No | Yes | Yes |
Contig Support
Jillion has objects that represent contigs produced by several assembler programs that are used internally by JCVI including Phrap/Consed .ace files, Celera Assembler .asm files and CLC Bio Assembly Cell .cas files among others. Each contig object not only has the contig consensus sequence but also includes all the underlying read information. Coupled with support for all the various read formats, it is possible to analyze, edit and write out new assembly files. Even though all the underlying read data is stored for each contig, memory usage is kept low. Nucleotide sequence objects for reads that have been assembled into a contig can be encoded to only store a pointer to the contig consensus sequence, the read's start offset into the consensus and any differences in the read sequence vs. the alignment to the contig consensus (if any). This greatly reduces the memory usage for storing underlying contig data since most reads in an assembly have a high identity to the consensus sequence and therefore, few differences.
| Format | BioJava
Read |
BioJava
Write |
Jillion
Read |
Jillion
Write |
|---|---|---|---|---|
| Phrap/Consed .ace | No | No | Yes | Yes |
| Celera .asm | No | Yes | ||
| CLC Bio .cas | No | No | Yes | Yes |
| TIGR .contig | No | No | Yes | Yes |
| TIGR .tasm | No | No | Yes | Yes |
Funding
This work has been funded in whole or part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under contract number HHSN272200900007C.