ingest Service
Splits the Crypt4GH header and moves it to database. The remainder of the file is sent to the storage backend (archive). No cryptographic tasks are done.
Service Description
The ingest service copies files from the file inbox to the archive, and registers them in the database.
When running, ingest reads messages from the configured RabbitMQ queue (commonly: ingest).
For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message):
- The message is validated as valid JSON that matches the
ingestion-triggerschema.- If the message can’t be validated it is discarded with an error message in the logs.
- If the message is of type
cancel, the file will be marked asdisabledand the next message in the queue will be read. - A file reader is created for the filepath in the message.
- If the file reader can’t be created an error is written to the logs, the message is Nacked and forwarded to the error queue.
- The file size is read from the file reader.
- On error, the error is written to the logs, the message is Nacked and forwarded to the error queue.
- A uuid is generated, and a file writer is created in the archive using the uuid as filename.
- On error the error is written to the logs and the message is Nacked and then re-queued.
- The filename is inserted into the database along with the user id of the uploading user. In case the file is already existing in the database, the status is updated.
- Errors are written to the error log.
- Errors writing the filename to the database do not halt ingestion progress.
- The header is read from the file, and decrypted to ensure that it’s encrypted with the correct key.
- If the decryption fails, an error is written to the error log, the message is Nacked, and the message is forwarded to the error queue.
- The header is written to the database.
- Errors are written to the error log.
- The header is stripped from the file data, and the remaining file data is written to the archive.
- Errors are written to the error log.
- The size of the archived file is read.
- Errors are written to the error log.
- The database is updated with the file size, archive path, and archive checksum, and the file is set as archived.
- Errors are written to the error log.
- This error does not halt ingestion.
- A message is sent back to the original RabbitMQ broker containing the upload user, upload file path, database file id, archive file path and checksum of the archived file.
Communication
Ingestreads messages from one RabbitMQ queue (commonly:ingest).Ingestpublishes messages to one RabbitMQ queue (commonly:archived).Ingestinserts file information in the database using three database functions,InsertFile,StoreHeader, andSetArchived.Ingestreads file data from inbox storage and writes data to archive storage.
Configuration
There are a number of options that can be set for the ingest service.
These settings can be set by mounting a yaml-file at /config.yaml with settings.
ex.
log:
level: "debug"
format: "json"
They may also be set using environment variables like:
export LOG_LEVEL="debug"
export LOG_FORMAT="json"
Keyfile settings
These settings control which crypt4gh keyfile is loaded.
C4GH_FILEPATH: filepath to the crypt4gh keyfileC4GH_PASSPHRASE: pass phrase to unlock the keyfile
RabbitMQ broker settings
These settings control how ingest connects to the RabbitMQ message broker.
BROKER_HOST: hostname of the RabbitMQ serverBROKER_PORT: RabbitMQ broker port (commonly:5671with TLS and5672without)BROKER_QUEUE: message queue to read messages from (commonly:ingest)BROKER_ROUTINGKEY: Routing key for publishing messages (commonly:archived)BROKER_USER: username to connect to RabbitMQBROKER_PASSWORD: password to connect to RabbitMQBROKER_PREFETCHCOUNT: Number of messages to pull from the message server at the time (default to2)
PostgreSQL Database settings:
DB_HOST: hostname for the postgresql databaseDB_PORT: database port (commonly:5432)DB_USER: username for the databaseDB_PASSWORD: password for the databaseDB_DATABASE: database nameDB_SSLMODE: The TLS encryption policy to use for database connections, valid options are:disableallowpreferrequireverify-caverify-full
More information is available in the postgresql documentation
Note that if DB_SSLMODE is set to anything but disable, then DB_CACERT needs to be set,
and if set to verify-full, then DB_CLIENTCERT, and DB_CLIENTKEY must also be set.
DB_CLIENTKEY: key-file for the database client certificateDB_CLIENTCERT: database client certificate fileDB_CACERT: Certificate Authority (CA) certificate for the database to use
Storage settings
The ingest service requires access to the "inbox", and "archive" storages, "backup" storage is optional if cancelled files are to be automatically deleted from the backup location as well.
storage:
inbox:
${STORAGE_IMPLEMENTATION}:
archive:
${STORAGE_IMPLEMENTATION}:
backup: # Exclude if no backup storage
${STORAGE_IMPLEMENTATION}:
For more details on available configuration see storage/v2 README.md
Logging settings:
LOG_FORMATcan be set tojsonto get logs in JSON format. All other values result in text logging.LOG_LEVELcan be set to one of the following, in increasing order of severity:tracedebuginfowarn(orwarning)errorfatalpanic