Data Submission
Ingestion Procedure
For a given FederatedEGA
node, CentralEGA
selects the associated vhost
and
drops, in the files
queue, one message per file to ingest.
Structure of the message and its contents are described in Message Format.
Note
Source code repository for Submission components is available at: https://github.com/neicnordic/sensitive-data-archive
Ingestion Workflow
Note
Ingestion Workflow Legend
The sequence diagram describes the different phases during the ingestion process. The elements at the top represent each of the services or actuators involved in the workflow. The interaction between these is depicted by horizontal arrows connecting the elements.
The vertical axis represents time progression down the page, where
processes are marked with colored vertical bars. The colors used for the
services/actuators match those used for the events initiated by the
respective services, except for the interactions in case of errors,
which are highlighted with red. The optional fragments are only executed
if errors occur in ingest
, verify
or finalize
services.
Note that the time axis in this diagram is all about the sequence of events not duration.
Ingestion Steps
The Ingest
service (can be replicated) reads file from the
Submission Inbox
and splits Crypt4GH header from the beginning of the
file, puts it in a database and sends the remainder to the Archive
,
leveraging the Crypt4GH format.
Note
There is no decryption key retrieved during that step. The Archive
can
be either a regular file system on disk, or an S3 object storage.
Submission Inbox
can also have as a backend a regular file system or
S3 object storage.
The files are read chunk by chunk in order to bound the memory usage.
After completion, a message is dropped into the local message broker to
signal that the Verify
service can check the file corresponds to what
was submitted. It also ensures that the stored file is decryptable and
that the integrated checksum is valid.
At this stage, the associated decryption key is retrieved. If decryption
completes and the checksum is valid, a message of completion is sent to
CentralEGA
: Ingestion completed.
Important: If a file disappears or is overwritten in the inbox before ingestion is completed, ingestion may not be possible.
Should any of the aforementioned steps result in an error, the workflow is terminated, and the error is logged. If the error is attributed to user misuse, such as providing an incorrect checksum or tampering with the encrypted file, it is reported to CentralEGA
for display in the Submission Interface.
Submission Inbox
CentralEGA
contains a database of users, with IDs and passwords. Multiple solutions
have been developed to facilitate user authentication
against the CentralEGA user database.:
Every solution utilizes CentralEGA's user IDs and is planned for
extension to incorporate Elixir IDs, from which the @elixir-europe.org
suffix is removed.
The procedure is as follows: the inbox is started without any created
user. When a user wants to log into the inbox (via sftp
, s3
or
https
), the inbox service looks up the username in a local queries the
CentralEGA REST endpoint. Upon the user's return, their credentials are
stored in the local cache, and a home directory for the user is created.
The user now gets logged in if the password or public key authentication succeeds.
SFTP Inbox
Federated EGA/LocalEGA login system
CentralEGA
contains a database of users, with IDs and passwords.
A solution has been devised using Apache Mina SSHD to facilitate user authentication through either a password or an RSA key, directly against the CentralEGA database. The user is locked within their home folder, which is done programmatically using RootedFileSystem.
The solution uses CentralEGA
's user IDs but can also be extended to
use LifeScience AAI IDs (from which the @elixir-europe.org
suffix is removed).
The procedure is as follows. The inbox is started without any created
user. When a user wants to log into the inbox (actually, only sftp
uploads are allowed), the code looks up the username in a local
cache, and, if not found, queries the CentralEGA
REST endpoint. Upon the user's return, their credentials are stored in the local cache, and a home directory is established for the user. The user now gets logged in if the password
or public key authentication succeeds. Upon subsequent login attempts,
only the local cache is queried, until the user's credentials
expire. The cache has a default TTL of 5 minutes, and is wiped clean
upon reboot (as a cache should). Default TTL can be configured via CACHE_TTL
env var.
The user's home directory is created when its credentials upon successful login. Moreover, for each user, detection is performed to ascertain when the file upload is completed, and the checksum for the uploaded file is computed.
S3 integration
Default storage back-end for the inbox is local file-system. Additionally, support for the S3 service is provided as a back-end option. It can be enabled using S3-related env-vars (see configuration details below).
If S3 is enabled, then files are still going to be stored locally, but after successful upload, they will going to be uploaded to the specified S3 back-end. With this approach local file-system plays role of so called "staging area", while S3 is the real final destination for the uploaded files.
Configuration
Environment variables used:
Variable name | Default value | Description |
---|---|---|
BROKER_USERNAME | guest | RabbitMQ broker username |
BROKER_PASSWORD | guest | RabbitMQ broker password |
BROKER_HOST | mq | RabbitMQ broker host |
BROKER_PORT | 5672 | RabbitMQ broker port |
BROKER_VHOST | / | RabbitMQ broker vhost |
BROKER_EXCHANGE | sda | RabbitMQ broker exchange |
BROKER_ROUTING_KEY | files | RabbitMQ broker routing key |
INBOX_PORT | 2222 | Inbox port |
INBOX_LOCATION | /ega/inbox/ | Path to POSIX Inbox backend |
INBOX_FS_PATH | Prefix path when custom filesystem is used on top of POSIX | |
INBOX_KEYPAIR | Path to RSA keypair file | |
KEYSTORE_TYPE | JKS | Keystore type to use, JKS or PKCS12 |
KEYSTORE_PATH | /etc/ega/inbox.jks | Path to Keystore file |
KEYSTORE_PASSWORD | Password to access the Keystore | |
CACHE_TTL | 300.0 | CEGA credentials time-to-live |
CEGA_ENDPOINT | CEGA REST endpoint | |
CEGA_ENDPOINT_CREDS | CEGA REST credentials | |
S3_ENDPOINT | inbox-backend:9000 | Inbox S3 backend URL |
S3_REGION | us-east-1 | Inbox S3 backend region (us-east-1 is default in Minio) |
S3_ACCESS_KEY | Inbox S3 backend access key (S3 disabled if not specified) | |
S3_SECRET_KEY | Inbox S3 backend secret key (S3 disabled if not specified) | |
S3_BUCKET | Inbox S3 backend secret bucket (S3 disabled if not specified) | |
USE_SSL | true | true if S3 Inbox backend should be accessed by HTTPS |
LOGSTASH_HOST | Hostname of the Logstash instance (if any) | |
LOGSTASH_PORT | Port of the Logstash instance (if any) |
If LOGSTASH_HOST
or LOGSTASH_PORT
is empty, Logstash logging will not be enabled.
In addition, environment variables can be used to configure log level for different packages. Package loggers can be configured using corresponding package names, for example, to turn of logs of Spring, one can set environment variable LOGGING_LEVEL_ORG_SPRINGFRAMEWORK=OFF
, or to set Mina's own logs to debug: LOGGING_LEVEL_SE_NBIS_LEGA_INBOX=DEBUG
, etc.
SFTP Inbox Local Development/Testing
For local development/testing see instructions in dev_utils folder. There is an README file in the dev_utils folder with sections for running the pipeline locally using Docker Compose.
Note
Sources are located at the separate repository: https://github.com/neicnordic/sensitive-data-archive/tree/main/sda-sftp-inbox Essentially, it's a Spring-based Maven project, integrated with the Local Message Broker.
TSD File API
In order to utilise Tryggve2 SDA within TSD Several components have been developed:
- https://github.com/unioslo/tsd-file-api
- https://github.com/uio-bmi/LocalEGA-TSD-proxy
- https://github.com/unioslo/tsd-api-client
Note
Access is restricted to UiO network. Please, contact TSD support for the access, if needed. Documentation: https://test.api.tsd.usit.no/v1/docs/tsd-api-integration.html
S3 Proxy Inbox
Note
Sources are located at the separate repository: https://github.com/neicnordic/sensitive-data-archive/blob/main/sda/cmd/s3inbox/
The S3 Proxy uses access tokens as the main authentication mechanism.
The sda authentication service (https://github.com/neicnordic/sensitive-data-archive/tree/main/sda-auth) is designed to convert CEGA REST endpoint authentication to a JWT that can be used when uploading to the S3 proxy.
The proxy requires the user to set the bucket name the same as the
username when uploading data,
s3cmd put FILE s3://USER_NAME/path/to/file
s3inbox Service
The s3inbox
proxies uploads to the an S3 compatible storage backend. Users are authenticated with a JWT instead of access_key
and secret_key
used normally for S3
.
Service Description
The s3inbox
proxies uploads to an S3 compatible storage backend.
- Parses and validates the JWT token (
access_token
in the S3 config file) against the public keys, either locally provisioned or from OIDC JWK endpoints. - If the token is valid the file is passed on to the S3 backend
- The file is registered in the database
- The
inbox-upload
message is sent to theinbox
queue, with thesub
field from the token as theuser
in the message. If this fails an error will be written to the logs.
Communication
s3inbox
proxies uploads to inbox storage.s3inbox
inserts file information in the database using theRegisterFile
database function and marks it as uploaded in thefile_event_log
s3inbox
writes messages to one RabbitMQ queue (commonly:inbox
).
Configuration
There are a number of options that can be set for the s3inbox
service.
These settings can be set by mounting a yaml-file at /config.yaml
with settings.
ex.
log:
level: "debug"
format: "json"
They may also be set using environment variables like:
export LOG_LEVEL="debug"
export LOG_FORMAT="json"
Server settings
These settings control the TLS status and where the service gets the public keys to validate the JWT tokens.
SERVER_CERT
: path to the x509 certificate used by the serviceSERVER_KEY
: path to the x509 private key used by the serviceSERVER_JWTPUBKEYPATH
: full path to the folder containing public keys used to validate JWT tokensSERVER_JWTPUBKEYURL
: URL to OIDC JWK endpoint
RabbitMQ broker settings
These settings control how verify connects to the RabbitMQ message broker.
BROKER_HOST
: hostname of the RabbitMQ serverBROKER_PORT
: RabbitMQ broker port (commonly:5671
with TLS and5672
without)BROKER_QUEUE
: message queue to read messages from (commonly:archived
)BROKER_ROUTINGKEY
: Routing key for publishing messages (commonly:verified
)BROKER_USER
: username to connect to RabbitMQBROKER_PASSWORD
: password to connect to RabbitMQBROKER_PREFETCHCOUNT
: Number of messages to pull from the message server at the time (default to2
)
PostgreSQL Database settings
DB_HOST
: hostname for the postgresql databaseDB_PORT
: database port (commonly:5432
)DB_PASSWORD
: password for the databaseDB_DATABASE
: database nameDB_SSLMODE
: The TLS encryption policy to use for database connections, valid options are:disable
allow
prefer
require
verify-ca
verify-full
More information is available in the postgresql documentation
Note that if DB_SSLMODE
is set to anything but disable
, then DB_CACERT
needs to be set, and if set to verify-full
, then DB_CLIENTCERT
, and DB_CLIENTKEY
must also be set.
DB_CLIENTKEY
: key-file for the database client certificateDB_CLIENTCERT
: database client certificate fileDB_CACERT
: Certificate Authority (CA) certificate for the database to use
Storage settings
INBOX_TYPE
: Valid value isS3
INBOX_URL
: URL to the S3 systemINBOX_ACCESSKEY
: The S3 access and secret key are used to authenticate to S3, more info at AWSINBOX_SECRETKEY
: The S3 access and secret key are used to authenticate to S3, more info at AWSINBOX_BUCKET
: The S3 bucket to use as the storage rootINBOX_PORT
: S3 connection port (default:443
)INBOX_REGION
: S3 region (default:us-east-1
)INBOX_CHUNKSIZE
: S3 chunk size for multipart uploads.INBOX_CACERT
: Certificate Authority (CA) certificate for the storage system, this is only needed if the S3 server has a certificate signed by a private entity
Logging settings
LOG_FORMAT
can be set to “json” to get logs in json format. All other values result in text loggingLOG_LEVEL
can be set to one of the following, in increasing order of severity:trace
debug
info
warn
(orwarning
)error
fatal
panic