Projections meta filesystem

Open-source solution for scientific collaboration and data exchange
The project is inspired by the idea of global collaboration and data exchange
Scientific data may exist in different forms, such as text and binary files, SQL and NoSQL database records, data objects behind common or vendor-specific application programming interfaces (API). Typical bioinformatics project requires simultaneous access to data spread between multiple data locations and formats each having particular logical structure.

Projections meta filesystem aimed to provide uniform file-based access to heterogeneous resources and decouple logical resource representation from physical data storage. Projections system uses small text files for resource-agnostic description of logical data organization and set of drivers that project actual data objects from some local or remote resource on local FUSE-mounted filesystem. Metadata is first-class citizen enabling versatile metadata descriptions and providing flexible search capabilities.

Usage scenarios
Projections meta filesystem: file access to non-file resources
File access to non-file resources
Access to database records, object- and file-based storage, HTTP APIs entities as filesystem objects
Many software and bioinformatics tools are designed to operate on data stored in files that are organized in directories. These tools have difficulties with consuming data stored on remote services or hidden behind vendor- or project-specific APIs. Projections expose remote data objects as local filesystem resources thus enabling usage of traditional data-consuming patterns without data exports or scripting.
Data objects exchange
Exchange of data resource representations and selected data objects by the mean of text files
Projections structure described in YAML-formatted text files called prototype files. Prototype can depict whole resource or its part. The example provided shows how GenBank query targeting Escherichia coli whole genome sequences can be described with prototype syntax. Prototype files can be easily transferred via network. Resulting projection includes FASTA and GenBank files named according to GenBank IDs.
Projections meta filesystem: data objects exchange
Projections meta filesystem: data analysis on user request
Data analysis on user request
Data discovery and transfer-free data and metadata sharing
Projection provide logical representation of resource including its metadata that can be searched and analyzed, while data transfer is typically suspended until the data in actually needed. Projection content can also be defined as search query result, so it is possible to create projection acting as data discovery resource: imagine that you are interested in genome sequence of certain organism, by creating GenBank projection for corresponding search query you will get genome file and corresponding metadata once it emerges.
Metadata
Using metadata for search and ability to annotate data objects with searchable custom metadata
Metadata is first-class citizen: schemaless JSON descriptions and object-linking possibility exceeds traditional tags and key-value approaches to metadata storage.

  • Metadata is exposed as file objects and can be used to guide data analysis process by traditional tools such as Make or Snakemake. E.g. bioinformatics tools parameters may be adjusted based on run metadata (insert size, sequencing primer sets, etc.).
  • Metadata can be searched both inside a single projection and between projections providing a capability of cross-resource searches.
  • While metadata is readily accessible for reading and searching, typical data access scenario assumes that actual data transfer is suspended until the data actually needed. This data upon request paradigm reduces local data storage requirement and enables continuous data discoverability.
Projections meta filesystem: metadata storage
Types of projections
Projections is extendable system that is currently equipped with drivers enabling access to following resources:
ThermoFisher
Torrent SuiteTM Software

access to the sequencing results from Life Technologies (Thermo Fisher Scientific) sequencers (Ion PGM / Ion Proton)
Illumina
MiSeq/HiSeq

access to the sequencing results
from MiSeq/HiSeq
sequencers
File
System

access to the local
files
Amazon S3

access to the Amazon Simple Storage Service objects
NCBI SRA

access to the data (BAM)
and metadata
Genbank

access to the data (FASTA)
and metadata
Projections are based on Filesystem in Userspace (FUSE) and can run on any modern Linux machine. The Projections project is at the stage of pre-alpha release: we prove different technical solutions and try to cover maximum range of usage scenarios. However, a number of key points are already implemented and can be tested in practice, we have tried to simplify this process by packaging the application in a Docker container.
The project is licensed under GPLv3.
Access to the source code and documentation
is available at GitHub.
Made on
Tilda