Main Page

From Jillion
Jump to: navigation, search

Jillion

The Java Informatics Large Library fOr geNomics (Jillion) is a genomics software library written in Java to support bioinformatics. This library was created by a single Software Engineer at the J. Craig Venter Institute (JCVI) and used by several projects including The Influenza Genome Project, Leptospira and the Human Microbiome Project.

Jillion requires Java 6 or later however, Java 7 is recommended since some boiler plate code can be removed by using the new Java 7 feature "try-with-resource" to auto close many of Jillion's frequently used objects including DataStores, StreamingIterators and writers.

Jillion is open source and licensed under the GPL v3.0


Contents

How is Jillion Different Than BioJava and Picard?

BioJava and Picard are other Java libraries for bioinformatics that are similar to Jillion. Each of these libraries support some common bioinformatic read formats such as FASTA and FASTQ but there the similarities end. BioJava focuses mainly on input reads and genome annotation where as Jillion focuses on genome assembly. Picard focuses mainly on SAM alignment data. Jillion supports not only input reads and alignments but also has object representations of contigs as well as parsers and writers for many common assembly file formats such as SAM/BAM and Consed's ACE format.

Sequence Support

Like BioJava, Jillion can handle various read input formats such as fasta, fastq, and scf encoded files, but Jillion can also natively handle other formats such as sff, ztr and abi chromatograms. Sequence objects also have different implementations depending on the use case and type of data. For example, a NucleotideSequence object which contains only the nucleotides A,C,G and T could represent each nucleotide as 2 bits each. A different implementation that stores each nucleotide as 4 bits would be used if the sequence contained ambiguous bases. Since quality sequences often have consecutive quality scores of the same value, a run length implementation can compactly store reads or even contig consensus qualities in only a few bytes.

Format Version BioJava

Read

BioJava

Write

Picard

Read

Picard

Write

Jillion

Read

Jillion

Write

Sanger Chromatogram Format read and write support for both BioJava and Jillion
Abi Yes No Yes
Ztr 1.2 No No No No Yes Yes
Scf 2 Yes No No No Yes Yes
Scf 3 Yes No No No Yes Yes


Format Encoding BioJava

Read

BioJava

Write

Picard

Read

Picard

Write

Jillion

Read

Jillion

Write

Jillion and BioJava can both read and write fasta and fastq files, but only Jillion supports sff files. Jillion has been tested on sff files produced by 454 and Ion Torrent.
Fasta nucleotide Yes Yes Yes No Yes Yes
Fasta protein Yes Yes No No Yes Yes
Fasta qualities No No No No Yes Yes
Fasta positions No No No No Yes Yes
Fastq sanger/solexa/illumina Yes Yes Yes Sanger only Yes Yes
sff No No No No Yes Yes
bfa (MAQ binary fasta) No No Yes Yes Yes Yes
bfq (MAQ) binary fastq) No No Yes Yes Yes Yes
sam No No Yes Yes Yes Yes
bam No No Yes Yes Yes Yes

Contig Support

Jillion has objects that represent contigs produced by several assembler programs that are used internally by JCVI including Phrap/Consed .ace files, Celera Assembler .asm files and CLC Bio Assembly Cell .cas files among others. Each contig object not only has the contig consensus sequence but also includes all the underlying read information. Coupled with support for all the various read formats, it is possible to analyze, edit and write out new assembly files. Even though all the underlying read data is stored for each contig, memory usage is kept low. Nucleotide sequence objects for reads that have been assembled into a contig can be encoded to only store a pointer to the contig consensus sequence, the read's start offset into the consensus and any differences in the read sequence vs. the alignment to the contig consensus (if any). This greatly reduces the memory usage for storing underlying contig data since most reads in an assembly have a high identity to the consensus sequence and therefore, few differences.


Format BioJava

Read

BioJava

Write

Picard

Read

Picard

Write

Jillion

Read

Jillion

Write

Unlike BioJava and Picard, Jillion can read and write several different assembly output formats. The Jillion contig objects include the consensus sequence as well as all the underlying sequence read data.
Phrap/Consed .ace No No No No Yes Yes
Celera .asm No No Yes
CLC Bio .cas No No Yes
TIGR .contig No No No No Yes Yes
TIGR .tasm No No No No Yes Yes

Funding

This work has been funded in whole or part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under contract number HHSN272200900007C.

Personal tools
Namespaces

Variants
Actions
Navigation
Javadoc
Community
Toolbox