NeIC Sensitive Data Archive

The NeIC Sensitive Data Archive (SDA) is an encrypted data archive, implemented for storage of sensitive data. It is implemented as a modular microservice system that can be deployed in different configurations depending on the service needs.

The modular architecture of SDA supports both stand alone deployment of an archive, and the use case of deploying a Federated node in the Federated European Genome-phenome Archive network (FEGA), serving discoverable sensitive datasets in the main EGA web portal.

Note

Throughout this documentation, reference to Central EGA may be made as CEGA or CentralEGA, and any FederatedEGA instance is alternatively known as FEGA, LEGA, or LocalEGA. Within the context of NeIC, the Federated EGA is denoted as the Sensitive Data Archive or SDA.

Organisation of the NeIC SDA Operations Handbook

This operations handbook is organized in four main parts, that each has it's own main section in the left navigation menu. Here is a condensed summary, follow the links below or use the menu navigation to each section's own detailed introduction page:

  1. Structure: Provides overview material for how the services can be deployed in different constellations and highlights communication paths.

  2. Communication: Provides more detailed documentation focused on inter-service communication, as OpenAPI-specs for APIs, RabbitMQ message flow, and database information flow details.

  3. Services: Per service detailed specifications and documentation.

  4. Guides: Topic-guides for topics like "Deployment", "Federated vs. Stand-alone", "Troubleshooting services", etc.

SDA Components and Architecture

The main components and the interaction between them, based on the NeIC Sensitive Data Archive deployment in a FederatedEGA setup, are illustrated in the figure below. The different colored backgrounds represent different zones of separation in the federated deployment.

The components illustrated can be classified by which archive sub-process they take part in:

  • Submission - the process of submitting sensitive data and meta-data to the inbox staging area
  • Ingestion - the process of verifying uploaded data and securely storing it in archive storage, while synchronizing state and identifier information with CEGA
  • Data Retrieval - the process of re-encrypting and staging data for retrieval/download.
Service/component Description Archive sub-process
Database A Postgres database with appropriate schema, stores the file header, the accession id, file path and checksums as well as other relevant information. Submission, Ingestion and Data Retrieval
MQ A RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings. We use a federated queue to get messages from CentralEGA's broker and shovels to send answers back. Submission and Ingestion
Inbox Upload service for incoming data, acting as a dropbox. Uses credentials from CentralEGA. Submission
Intercept Relays messages between the queue provided from the federated service and local queues. Submission and Ingestion
Ingest Splits the Crypt4GH header and moves it to the database. The remainder of the file is sent to the storage backend (archive). No cryptographic tasks are done. Ingestion
Verify Using the archive crypt4gh secret key, this service can decrypt the stored files and checksum them against the embedded checksum for the unencrypted file. Ingestion
Finalize Handles the so-called Accession ID (stable ID) to filename mappings from CentralEGA. Ingestion
Mapper The mapper service register mapping of accessionIDs (stable ids for files) to datasetIDs. Ingestion
Archive (Storage) Storage backend: can be a regular (POSIX) file system or a S3 object store. Ingestion and Data Retrieval
Data Retrieval API Provides a download/data access API for streaming archived data either in encrypted or decrypted format. Data Retrieval
Inbox (Storage) Storage backend: can be a regular (POSIX) file system or a S3 object store. Ingestion
Backup (Storage) Storage backend: can be a regular (POSIX) file system or a S3 object store. Ingestion