Main Page

From Jillion
Jump to: navigation, search

Jillion

The Java Informatics Large Library fOr geNomics (Jillion) is an open source genomics software library written in Java to support bioinformatics. This library was created by a single Software Engineer at the J. Craig Venter Institute (JCVI) and used by several projects including The Influenza Genome Project, Leptospira and the Human Microbiome Project and used in over 20,000 viral whole and draft genome submissions to Genbank.

In September 2015, The Viral Pathogen Database and Analysis Resource (ViPR) added a new Rotavirus A Genotype Detection Tool written using Jillion and is over an order of magnitude faster than other similar webtools.

Contents

Open Source Licenses

Jillion 5 is open sourced licensed under LGPL 2.1

Jillion 4 is open source and licensed under GPL v3.0

Jillion 5.1 Released

2015-12-10 : Jillion 5.1 is now released. Download Jillion 5.1

New Features

  • Support for reading and writing Samtools fasta index files (.fai).
  • NucleotideFastaDataStore and ProteinFastaDataStores will now utilize a fasta index file if one is present for faster random access lookups.
  • Added new getSequence() and getSubSequence() methods to FastaDataStore to quickly get the sequence from a fasta file or a part of the sequence from a fasta file.
  • A few bug Fixes

See the Jillion 5.1 Release Notes for more details

Jillion 5 Released

2015-10-26 : Jillion 5.0 is now released. Download Jillion 5

Over 300 commits were made by a single developer and include many new features. Jillion 5 now requires Java 8 since it leverages Java 8 Lambda expressions for a more concise and simpler API.

New Features

  • Java 8 Lambda support
  • Simplified syntax for filtering using lambdas
  • Improved SAM and BAM parsing and writing which uses bam indexes for improved parsing performance.
  • New GenomeStatistics class that can compute N50 and similar computations
  • Performance improvements; especially to Fastq file reading and writing.
  • Bug fixes
  • OSGI compliant jar

See the Jillion 5 Release Notes for more details

Old Release Notes

Click Here for the listing of all release notes for all previous versions.

Jillion 4 requires Java 6 or later however, Java 7 is recommended since some boiler plate code can be removed by using the new Java 7 feature "try-with-resource" to auto close many of Jillion's frequently used objects including DataStores, StreamingIterators and writers.


How is Jillion Different Than BioJava and Picard?

BioJava and Picard are other Java libraries for bioinformatics that are similar to Jillion. Each of these libraries support some common bioinformatic read formats such as FASTA and FASTQ but there the similarities end. BioJava focuses mainly on input reads and genome annotation where as Jillion focuses on genome assembly. Picard focuses mainly on SAM alignment data. Jillion supports not only input reads and alignments but also has object representations of contigs as well as parsers and writers for many common assembly file formats such as SAM/BAM and Consed's ACE format.

Sequence Support

Like BioJava, Jillion can handle various read input formats such as fasta, fastq, and scf encoded files, but Jillion can also natively handle other formats such as sff, ztr and abi chromatograms. Sequence objects also have different implementations depending on the use case and type of data. For example, a NucleotideSequence object which contains only the nucleotides A,C,G and T could represent each nucleotide as 2 bits each. A different implementation that stores each nucleotide as 4 bits would be used if the sequence contained ambiguous bases. Since quality sequences often have consecutive quality scores of the same value, a run length implementation can compactly store reads or even contig consensus qualities in only a few bytes.

Format Version BioJava

Read

BioJava

Write

Picard

Read

Picard

Write

Jillion

Read

Jillion

Write

Sanger Chromatogram Format read and write support for both BioJava and Jillion
Abi Yes No Yes
Ztr 1.2 No No No No Yes Yes
Scf 2 Yes No No No Yes Yes
Scf 3 Yes No No No Yes Yes


Format Encoding BioJava

Read

BioJava

Write

Picard

Read

Picard

Write

Jillion

Read

Jillion

Write

All the popular bioinformatics libraries can read write fasta and fastq files, but only Jillion supports sff files. Jillion has been tested on sff files produced by 454 and Ion Torrent.
Fasta nucleotide Yes Yes Yes No Yes Yes
Fasta protein Yes Yes No No Yes Yes
Fasta qualities No No No No Yes Yes
Fasta positions No No No No Yes Yes
Fasta index (fai) nuclotide No No Yes Yes Yes Yes
Fasta index (fai) protein No No Yes Yes Yes Yes
Fastq sanger/solexa/illumina Yes Yes Yes Sanger only Yes Yes
sff No No No No Yes Yes
bfa (MAQ binary fasta) No No Yes Yes Yes Yes
bfq (MAQ) binary fastq) No No Yes Yes Yes Yes

Assembly Support

Jillion has objects that represent contigs produced by several assembler programs that are used internally by JCVI including Phrap/Consed .ace files, Celera Assembler .asm files and CLC Bio Assembly Cell .cas files among others. Each contig object not only has the contig consensus sequence but also includes all the underlying read information. Coupled with support for all the various read formats, it is possible to analyze, edit and write out new assembly files. Even though all the underlying read data is stored for each contig, memory usage is kept low. Nucleotide sequence objects for reads that have been assembled into a contig can be encoded to only store a pointer to the contig consensus sequence, the read's start offset into the consensus and any differences in the read sequence vs. the alignment to the contig consensus (if any). This greatly reduces the memory usage for storing underlying contig data since most reads in an assembly have a high identity to the consensus sequence and therefore, few differences.


Format BioJava

Read

BioJava

Write

Picard

Read

Picard

Write

Jillion

Read

Jillion

Write

Unlike BioJava and Picard, Jillion can read and write several different assembly output formats. The Jillion contig objects include the consensus sequence as well as all the underlying sequence read data.
Phrap/Consed .ace No No No No Yes Yes
Celera .asm No No Yes
CLC Bio .cas No No Yes
TIGR .contig No No No No Yes Yes
TIGR .tasm No No No No Yes Yes
sam No No Yes Yes Yes Yes
bam No No Yes Yes Yes Yes

Funding

This work has been funded in whole or part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under contract numbers HHSN272200900007C and U19AI110819.

Personal tools
Namespaces

Variants
Actions
Navigation
Javadoc
Community
Toolbox