DataStores

From Jillion
Jump to: navigation, search

A DataStore is an abstraction for a repository for multiple genomic objects that can be fetched by an ID. This is very similar to the java.util.Map object.

Comparison of Jillion DataStore methods and the java.util.Map equivalent
DataStore<T> Map<String, T>
contains(String id) contains(String key)
get(String id) get(String key)
getNumberOfRecords() size()
idIterator() keySet().iterator()
iterator() values().iterator()
isClosed()
close()


The DataStore abstracts how the implementations store their objects. Many DataStores store their values all in memory, but some may wrap flat files on disk or possibly look up values in a database or URLs to websites.

Using the abstraction simplifies client code and allows implementations to be changed without breaking client code.

Contents

Working with DataStores

DataStores have a simple interface:

public interface DataStore<T> extends Closeable{
 
    StreamingIterator<String> idIterator() throws DataStoreException;
 
    T get(String id) throws DataStoreException;
 
    boolean contains(String id) throws DataStoreException;
 
    long getNumberOfRecords() throws DataStoreException;
 
    boolean isClosed() throws DataStoreException;
 
    StreamingIterator<T> iterator() throws DataStoreException;
 
}

StreamingIterators

A StreamingIterator is an iterator that is also Closeable.

Closing a StreamingIterator before it has finished iterating over all of its elements will change the iterator's state to act as if it has finished iterating: the iterator.hasNext() will return false and iterator.next() will throw NoSuchElementException.


StreamingIterator SequenceDiagram.png

Closing an iterator before it has finished iterating is very useful because it allows stopping iteration early and can tell the iterator to clean up its resources. For example, let's say a DataStore implementation is wrapping the contents of a flat file. Calling datastore.iterator() will open a file handle and read the file iterating over each record in the as it goes (similar to a generator in python). If the client code needs to iterate over the contents of a datastore until it finds a particular record, then the iterator can be closed once that record is found without since there is no need to iterate over the rest of the file. However if the client doesn't call iterator.close() then the iterator and file handle will remain open.

StreamingIterators must always be closed in a finally block in case while iterating, an Exception is thrown. Without the call to close in a finally block, the iterator has no way of cleaning itself up and could cause deadlock or forever blocked threads. If users are using Java 7, it is recommended that StreamingIterators are created inside the new Java 7 try-with-resource statement.

try(StreamingIterator<T> iter = datastore.iterator() ){
  while(iter.hasNext() ){
      //.. do stuff
  }
 
} // end of try will automatically close iter in Java 7 try-with-resource

DataStores have 2 methods that return StreamingIterator. idIterator() which returns a StreamingIterator of Strings which are the ids of each record in the datastore (like sequence or contig names). And iterator() which returns a StreamingIterator of the elements in the DataStore. Since StreamingIterator are used for DataStores which are immutable, StreamingIterator.remove() will always throw an UnSupportedOperationException.

Wrapping genomic files using DataStoreBuilders

Many Jillion DataStores implementations wrap genomic files. For example, clients can wrap a fastq file in a DataStore so that each fastq record in the fastq file can be randomly accessed using DataStore.get() method. These types of DataStores are created using FileDataStoreBuilders.

FileDataStoreBuilders

DataStoreFilters

DataStores that wrap files don't necessarily need to include all the records in the file. DataStoreFilters allow users to tell the Builder which records in the file to include or exclude based on the record's id. If a record contained in the file is excluded from the DataStore, then DataStore.contains() will return false and DataStore.get() will return null just as if that record did not exist in the file.

DataStoreFilter is a simple interface in Jillion

public interface DataStoreFilter {
    boolean accept(String id);
}

If a record should be included in the datastore, then the accept method should return true; if the record should be excluded then it will return false.

Thee org.jcvi.jillion.core.datastore.DataStoreFilters class provides many built in implementations of DataStoreFilter to cover most common usecases including include and exclude filters and regular expression filters.

DataStoreProviderHints

FileDataStoreBuilders are similar to other Jillion Builders; the FileDataStoreBuilders.build() method controls which implementation is returned. Users can help influence which implementation is returned by providing "hints" to the Builder to tell it the use cases in which the DataStore will be used.

There are currently 3 DataStoreProviderHints that can be passed to FileDataStoreBuilders:

RANDOM_ACCESS_OPTIMIZE_SPEED

This allows for quick random access to any object in the datastore at the price of taking up a lot of memory. Most will implementations use a java.util.Map to store all the records in the DataStore. If no DataStoreProviderHint is used, most Builders currently will default to this implementation.

RANDOM_ACCESS_OPTIMIZE_MEMORY

This hint still allows random access but will try to keep memory use as low as possible. These implementations will not store all records into memory. Instead, many FileDataStore implementations for this hint will instead store only the file offsets into the wrapped file for each record. When client code asks for a specific object by id, the datastore will re-open the file and "seek" to the corresponding byte offset and then re-parse only that record. This type of implementation takes up very little space. For example, an ace file that has several contigs might take up hundreds of MB of memory if a all the contigs are stored in memory (using the RANDOM_ACCESS_OPTIMIZE_SPEED), however the RANDOM_ACCESS_OPTIMIZE_MEMORY version will only take up 3KB ( 0.001 % of the size!). The file used by the datastore needs to exist and remain unchanged during the lifetime of this datastore object (otherwise the byte offsets will be wrong). Also, it should be noted that whenever each object is fetched, then that one object will take up the full amount of memory.

RANDOM_ACCESS_OPTIMIZE_MEMORY DataStores are recommended if the client code can be structured so that each object in the datastore can be processed one at time (or a few at a time), this will allow the Java Garbage collector to free the memory in between calls so only a few of the large objects will be in memory at any one time.

ITERATION_ONLY

This hint tells the Builder that the DataStore will never request records by random access. Only use DataStore.iterator() and DataStore.idIterator() will be used to ever fetch records sequentially. Any DataStore method call will cause the wrapped file to get re-parsed. This is very efficient for DataStore.iterator() and DataStore.idIterator() since clients will be iterating over the records as they are parsed; but other methods such as DataStore.get() DataStore.contains() or DataStore.getNumberOfRecords() are extremely inefficient since they too must re-parse the entire file. This is the extreme DataStore implementation doesn't store any records in memory but requires a lot of extra CPU and I/O time to get any data out of it. The ITERATION_ONLY implementation hint is recommended if the client code can be structured to stream through the data so that each object in the datastore can be processed one at time. This will allow the Java Garbage collector to free the memory in between calls so only a few of the large objects will be in memory at any one time. Also since the other DataStore implementations use a Map or array backing, they are usually limited to a max size of 231-1 records (assuming the computer has enough memory to store all those records), if users require a DataStore to have more records than that then this is currently the only Jillion built-in implementation that will work.

Caching DataStore results

Having to re-parse files for DataStore implementations that don't keep all of the contents in memory, can be time and CPU consuming. There is a special class named CachedDataStore which uses the Java Proxy class to wrap a DataStore implementation, match the required interface and cache calls to datastore.get(id). The internal cache uses a Least Recently Used (LRU) map of [Soft References] so only the most recent objects are stored in memory and they can be garbage collected if the JVM needs more memory.

  1. AceContigDataStore slowDataStore = new AceFileContigDataStoreBuilder(aceFile)
  2.                                          .hint(RANDOM_ACCESS_OPTIMIZE_MEMORY)
  3.                                          .build();
  4.  
  5. //create cached datastore with LRU cache of up to 10 records
  6. AceContigDataStore cachedDataStore = CachedDataStore.create(AceContigDataStore.class,       
  7.                                                             slowDataStore, 
  8.                                                             10);
  9. //1st time getting this contig needs to re-parse portion of ace file
  10. AceContig contig1 = cachedDataStore.get("contig1");  
  11.  
  12. //already cached this; can return it from cache, no need to re-parse
  13. AceContig contig1Again = cachedDataStore.get("contig1"); 
  14.  
  15. //do something memory intensive or get lots more contigs 
  16. //so cache no longer has contig1
  17. //...
  18.  
  19. //no longer in cache, need to re-parse portion of ace file.
  20. AceContig yetAnotherContig1 = cachedDataStore.get("contig1");

This code will first create an DataStore of AceContigs from an ace file using the RANDOM_ACCESS_OPTIMIZE_MEMORY hint whose implementation might require re-parsing portions of the ace file for every call to DataStore.get(). Lines 6-8 wrap the Datastore in a CachedDataStore which will try to store the 10 most recently fetched contigs. Each call to datastore.get(id) is intercepted and if the object with that id is already present in the cache, then the cachedDataStore can just return the cached value and not have to re-parse the file each time.

Personal tools
Namespaces

Variants
Actions
Navigation
Community
Toolbox