Data Submission

Ingestion Procedure

For a given FederatedEGA node, CentralEGA selects the associated vhost and drops, in the files queue, one message per file to ingest.

Structure of the message and its contents are described in Message Format.

Note

Source code repository for Submission components is available at: https://github.com/neicnordic/sensitive-data-archive

Ingestion Workflow

sequenceDiagram autonumber participant Upload Tool box SDA participant Inbox participant Ingest participant Verify participant Finalize participant Mapper participant SDA Database participant Intercept participant SDA RabbitMQ end box Central EGA participant Central EGA RabbitMQ end Upload Tool->>Inbox: upload encrypted file activate Inbox Inbox-->>SDA RabbitMQ: msg: Upload Done SDA RabbitMQ-->>Central EGA RabbitMQ: shovel msg:[to_cega][files.inbox] deactivate Inbox Central EGA RabbitMQ-->>SDA RabbitMQ: federated msg: [from_cega][ingest type] SDA RabbitMQ-->>Intercept: Intercept reads message Intercept -->> SDA RabbitMQ: Forwards ingest message <br/> to queue alt Ingest is successful SDA RabbitMQ->>Ingest: msg: [sda][ingest] begin ingestion activate Ingest Ingest->>SDA Database: mark ingested Note over Ingest: store file in Archive Ingest->>SDA Database: mark archived Ingest-->>SDA RabbitMQ: msg [sda][archived] else Error occurred in ingestion process Ingest-->>SDA RabbitMQ: msg: error SDA RabbitMQ-->>Central EGA RabbitMQ: shovel msg:[to_cega][files.error] end deactivate Ingest alt Verify is successful activate Verify SDA RabbitMQ-->>Verify: msg [sda][archived] triggers verify Verify->>SDA Database: mark verified Verify-->>SDA RabbitMQ: msg: [sda][verified] else Error occurred in verify process Verify-->>SDA RabbitMQ: msg: error SDA RabbitMQ-->>Central EGA RabbitMQ: shovel msg:[to_cega][files.error] end deactivate Verify SDA RabbitMQ-->>Central EGA RabbitMQ: shovel msg:[to_cega][files.verified] Central EGA RabbitMQ-->>SDA RabbitMQ: federated msg: [from_cega][accession type] SDA RabbitMQ-->>Intercept: Intercept reads message Intercept -->> SDA RabbitMQ: Forwards accession ID message <br/> to queue SDA RabbitMQ->>Finalize: msg: [sda][accession] map file to accession ID alt Finalize is successful activate Finalize note right of Finalize: Finalize makes the file backup Finalize->>SDA Database: mark completed Finalize-->>SDA RabbitMQ: msg: [sda][completed] else Error occurred in finalize process Finalize-->>SDA RabbitMQ: msg: error SDA RabbitMQ-->>Central EGA RabbitMQ: shovel msg:[to_cega][files.error] end deactivate Finalize SDA RabbitMQ-->>Central EGA RabbitMQ: shovel msg:[to_cega][files.completed] Central EGA RabbitMQ-->>SDA RabbitMQ: federated msg: [from_cega][mappings type] SDA RabbitMQ-->>Intercept: Intercept reads message Intercept -->> SDA RabbitMQ: Forwards mapper message of type mapping <br/> to queue SDA RabbitMQ->>Mapper: msg: [sda][mappings] map dataset to file accession ID alt Mapper Mapper creates dataset ID to file accession ID mapping activate Mapper Mapper->>SDA Database: map file to dataset accession ID Mapper->>Inbox: remove file from inbox else Error occurred in mapper process Mapper-->>SDA RabbitMQ: msg: error SDA RabbitMQ-->>Central EGA RabbitMQ: shovel msg:[to_cega][files.error] end Central EGA RabbitMQ-->>SDA RabbitMQ: federated msg: [from_cega][release type] SDA RabbitMQ-->>Intercept: Intercept reads message Intercept -->> SDA RabbitMQ: Forwards mapper message <br/> to queue SDA RabbitMQ->>Mapper: msg: [sda][mappings] release dataset alt Mapper flags dataset ready for release activate Mapper Mapper->>SDA Database: flag dataset ready for release else Error occurred in mapper process Mapper-->>SDA RabbitMQ: msg: error SDA RabbitMQ-->>Central EGA RabbitMQ: shovel msg:[to_cega][files.error] end Central EGA RabbitMQ-->>SDA RabbitMQ: federated msg: [from_cega][deprecate type] SDA RabbitMQ-->>Intercept: Intercept reads message Intercept -->> SDA RabbitMQ: Forwards mapper message <br/> to queue SDA RabbitMQ->>Mapper: msg: [sda][mappings] deprecate dataset alt Mapper flags dataset as deprecated activate Mapper Mapper->>SDA Database: flag dataset as deprecated else Error occurred in mapper process Mapper-->>SDA RabbitMQ: msg: error SDA RabbitMQ-->>Central EGA RabbitMQ: shovel msg:[to_cega][files.error] end deactivate Mapper

Note

Ingestion Workflow Legend

The sequence diagram describes the different phases during the ingestion process. The elements at the top represent each of the services or actuators involved in the workflow. The interaction between these is depicted by horizontal arrows connecting the elements.

The vertical axis represents time progression down the page, where processes are marked with colored vertical bars. The colors used for the services/actuators match those used for the events initiated by the respective services, except for the interactions in case of errors, which are highlighted with red. The optional fragments are only executed if errors occur in ingest, verify or finalize services. Note that the time axis in this diagram is all about the sequence of events not duration.

Ingestion Steps

The Ingest service (can be replicated) reads file from the Submission Inbox and splits Crypt4GH header from the beginning of the file, puts it in a database and sends the remainder to the Archive, leveraging the Crypt4GH format.

Note

There is no decryption key retrieved during that step. The Archive can be either a regular file system on disk, or an S3 object storage. Submission Inbox can also have as a backend a regular file system or S3 object storage.

The files are read chunk by chunk in order to bound the memory usage. After completion, a message is dropped into the local message broker to signal that the Verify service can check the file corresponds to what was submitted. It also ensures that the stored file is decryptable and that the integrated checksum is valid.

At this stage, the associated decryption key is retrieved. If decryption completes and the checksum is valid, a message of completion is sent to CentralEGA: Ingestion completed.

Important: If a file disappears or is overwritten in the inbox before ingestion is completed, ingestion may not be possible.

Should any of the aforementioned steps result in an error, the workflow is terminated, and the error is logged. If the error is attributed to user misuse, such as providing an incorrect checksum or tampering with the encrypted file, it is reported to CentralEGA for display in the Submission Interface.

Submission Inbox

CentralEGA contains a database of users, with IDs and passwords. Multiple solutions have been developed to facilitate user authentication against the CentralEGA user database.:

Every solution utilizes CentralEGA's user IDs and is planned for extension to incorporate Elixir IDs, from which the @elixir-europe.org suffix is removed.

The procedure is as follows: the inbox is started without any created user. When a user wants to log into the inbox (via sftp, s3 or https), the inbox service looks up the username in a local queries the CentralEGA REST endpoint. Upon the user's return, their credentials are stored in the local cache, and a home directory for the user is created. The user now gets logged in if the password or public key authentication succeeds.

SFTP Inbox

Federated EGA/LocalEGA login system

CentralEGA contains a database of users, with IDs and passwords.

A solution has been devised using Apache Mina SSHD to facilitate user authentication through either a password or an RSA key, directly against the CentralEGA database. The user is locked within their home folder, which is done programmatically using RootedFileSystem.

The solution uses CentralEGA's user IDs but can also be extended to use LifeScience AAI IDs (from which the @elixir-europe.org suffix is removed).

The procedure is as follows. The inbox is started without any created user. When a user wants to log into the inbox (actually, only sftp uploads are allowed), the code looks up the username in a local cache, and, if not found, queries the CentralEGA REST endpoint. Upon the user's return, their credentials are stored in the local cache, and a home directory is established for the user. The user now gets logged in if the password or public key authentication succeeds. Upon subsequent login attempts, only the local cache is queried, until the user's credentials expire. The cache has a default TTL of 5 minutes, and is wiped clean upon reboot (as a cache should). Default TTL can be configured via CACHE_TTL env var.

The user's home directory is created when its credentials upon successful login. Moreover, for each user, detection is performed to ascertain when the file upload is completed, and the checksum for the uploaded file is computed.

S3 integration

Default storage back-end for the inbox is local file-system. Additionally, support for the S3 service is provided as a back-end option. It can be enabled using S3-related env-vars (see configuration details below).

If S3 is enabled, then files are still going to be stored locally, but after successful upload, they will going to be uploaded to the specified S3 back-end. With this approach local file-system plays role of so called "staging area", while S3 is the real final destination for the uploaded files.

Configuration

Environment variables used:

Variable name Default value Description
BROKER_USERNAME guest RabbitMQ broker username
BROKER_PASSWORD guest RabbitMQ broker password
BROKER_HOST mq RabbitMQ broker host
BROKER_PORT 5672 RabbitMQ broker port
BROKER_VHOST / RabbitMQ broker vhost
BROKER_EXCHANGE sda RabbitMQ broker exchange
BROKER_ROUTING_KEY files RabbitMQ broker routing key
INBOX_PORT 2222 Inbox port
INBOX_LOCATION /ega/inbox/ Path to POSIX Inbox backend
INBOX_FS_PATH Prefix path when custom filesystem is used on top of POSIX
INBOX_KEYPAIR Path to RSA keypair file
KEYSTORE_TYPE JKS Keystore type to use, JKS or PKCS12
KEYSTORE_PATH /etc/ega/inbox.jks Path to Keystore file
KEYSTORE_PASSWORD Password to access the Keystore
CACHE_TTL 300.0 CEGA credentials time-to-live
CEGA_ENDPOINT CEGA REST endpoint
CEGA_ENDPOINT_CREDS CEGA REST credentials
S3_ENDPOINT inbox-backend:9000 Inbox S3 backend URL
S3_REGION us-east-1 Inbox S3 backend region (us-east-1 is default in Minio)
S3_ACCESS_KEY Inbox S3 backend access key (S3 disabled if not specified)
S3_SECRET_KEY Inbox S3 backend secret key (S3 disabled if not specified)
S3_BUCKET Inbox S3 backend secret bucket (S3 disabled if not specified)
USE_SSL true true if S3 Inbox backend should be accessed by HTTPS
LOGSTASH_HOST Hostname of the Logstash instance (if any)
LOGSTASH_PORT Port of the Logstash instance (if any)

If LOGSTASH_HOST or LOGSTASH_PORT is empty, Logstash logging will not be enabled.

In addition, environment variables can be used to configure log level for different packages. Package loggers can be configured using corresponding package names, for example, to turn of logs of Spring, one can set environment variable LOGGING_LEVEL_ORG_SPRINGFRAMEWORK=OFF, or to set Mina's own logs to debug: LOGGING_LEVEL_SE_NBIS_LEGA_INBOX=DEBUG, etc.

SFTP Inbox Local Development/Testing

For local development/testing see instructions in dev_utils folder. There is an README file in the dev_utils folder with sections for running the pipeline locally using Docker Compose.

Note

Sources are located at the separate repository: https://github.com/neicnordic/sensitive-data-archive/tree/main/sda-sftp-inbox Essentially, it's a Spring-based Maven project, integrated with the Local Message Broker.

TSD File API

In order to utilise Tryggve2 SDA within TSD Several components have been developed:

Note

Access is restricted to UiO network. Please, contact TSD support for the access, if needed. Documentation: https://test.api.tsd.usit.no/v1/docs/tsd-api-integration.html

S3 Proxy Inbox

Note

Sources are located at the separate repository: https://github.com/neicnordic/sensitive-data-archive/blob/main/sda/cmd/s3inbox/

The S3 Proxy uses access tokens as the main authentication mechanism.

The sda authentication service (https://github.com/neicnordic/sensitive-data-archive/tree/main/sda-auth) is designed to convert CEGA REST endpoint authentication to a JWT that can be used when uploading to the S3 proxy.

The proxy requires the user to set the bucket name the same as the username when uploading data, s3cmd put FILE s3://USER_NAME/path/to/file

s3inbox Service

The s3inbox proxies uploads to the an S3 compatible storage backend. Users are authenticated with a JWT instead of access_key and secret_key used normally for S3.

Service Description

The s3inbox proxies uploads to an S3 compatible storage backend.

  1. Parses and validates the JWT token (access_token in the S3 config file) against the public keys, either locally provisioned or from OIDC JWK endpoints.
  2. If the token is valid the file is passed on to the S3 backend
  3. The file is registered in the database
  4. The inbox-upload message is sent to the inbox queue, with the sub field from the token as the user in the message. If this fails an error will be written to the logs.
Communication
  • s3inbox proxies uploads to inbox storage.
  • s3inbox inserts file information in the database using the RegisterFile database function and marks it as uploaded in the file_event_log
  • s3inbox writes messages to one RabbitMQ queue (commonly: inbox).
Configuration

There are a number of options that can be set for the s3inbox service. These settings can be set by mounting a yaml-file at /config.yaml with settings.

ex.

log:
  level: "debug"
  format: "json"

They may also be set using environment variables like:

export LOG_LEVEL="debug"
export LOG_FORMAT="json"
Server settings

These settings control the TLS status and where the service gets the public keys to validate the JWT tokens.

  • SERVER_CERT: path to the x509 certificate used by the service
  • SERVER_KEY: path to the x509 private key used by the service
  • SERVER_JWTPUBKEYPATH: full path to the folder containing public keys used to validate JWT tokens
  • SERVER_JWTPUBKEYURL: URL to OIDC JWK endpoint
RabbitMQ broker settings

These settings control how verify connects to the RabbitMQ message broker.

  • BROKER_HOST: hostname of the RabbitMQ server
  • BROKER_PORT: RabbitMQ broker port (commonly: 5671 with TLS and 5672 without)
  • BROKER_QUEUE: message queue to read messages from (commonly: archived)
  • BROKER_ROUTINGKEY: Routing key for publishing messages (commonly: verified)
  • BROKER_USER: username to connect to RabbitMQ
  • BROKER_PASSWORD: password to connect to RabbitMQ
  • BROKER_PREFETCHCOUNT: Number of messages to pull from the message server at the time (default to 2)
PostgreSQL Database settings
  • DB_HOST: hostname for the postgresql database
  • DB_PORT: database port (commonly: 5432)
  • DB_PASSWORD: password for the database
  • DB_DATABASE: database name
  • DB_SSLMODE: The TLS encryption policy to use for database connections, valid options are:
    • disable
    • allow
    • prefer
    • require
    • verify-ca
    • verify-full

More information is available in the postgresql documentation

Note that if DB_SSLMODE is set to anything but disable, then DB_CACERT needs to be set, and if set to verify-full, then DB_CLIENTCERT, and DB_CLIENTKEY must also be set.

  • DB_CLIENTKEY: key-file for the database client certificate
  • DB_CLIENTCERT: database client certificate file
  • DB_CACERT: Certificate Authority (CA) certificate for the database to use
Storage settings
  • INBOX_TYPE: Valid value is S3
  • INBOX_URL: URL to the S3 system
  • INBOX_ACCESSKEY: The S3 access and secret key are used to authenticate to S3, more info at AWS
  • INBOX_SECRETKEY: The S3 access and secret key are used to authenticate to S3, more info at AWS
  • INBOX_BUCKET: The S3 bucket to use as the storage root
  • INBOX_PORT: S3 connection port (default: 443)
  • INBOX_REGION: S3 region (default: us-east-1)
  • INBOX_CHUNKSIZE: S3 chunk size for multipart uploads.
  • INBOX_CACERT: Certificate Authority (CA) certificate for the storage system, this is only needed if the S3 server has a certificate signed by a private entity
Logging settings
  • LOG_FORMAT can be set to “json” to get logs in json format. All other values result in text logging
  • LOG_LEVEL can be set to one of the following, in increasing order of severity:
    • trace
    • debug
    • info
    • warn (or warning)
    • error
    • fatal
    • panic