ingest Service
Splits the Crypt4GH header and moves it to database. The remainder of the file is sent to the storage backend (archive). No cryptographic tasks are done.
Service Description
The ingest
service copies files from the file inbox to the archive, and registers them in the database.
When running, ingest
reads messages from the configured RabbitMQ queue (commonly: ingest
).
For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message):
- The message is validated as valid JSON that matches the
ingestion-trigger
schema.- If the message can’t be validated it is discarded with an error message in the logs.
- If the message is of type
cancel
, the file will be marked asdisabled
and the next message in the queue will be read. - A file reader is created for the filepath in the message.
- If the file reader can’t be created an error is written to the logs, the message is Nacked and forwarded to the error queue.
- The file size is read from the file reader.
- On error, the error is written to the logs, the message is Nacked and forwarded to the error queue.
- A uuid is generated, and a file writer is created in the archive using the uuid as filename.
- On error the error is written to the logs and the message is Nacked and then re-queued.
- The filename is inserted into the database along with the user id of the uploading user. In case the file is already existing in the database, the status is updated.
- Errors are written to the error log.
- Errors writing the filename to the database do not halt ingestion progress.
- The header is read from the file, and decrypted to ensure that it’s encrypted with the correct key.
- If the decryption fails, an error is written to the error log, the message is Nacked, and the message is forwarded to the error queue.
- The header is written to the database.
- Errors are written to the error log.
- The header is stripped from the file data, and the remaining file data is written to the archive.
- Errors are written to the error log.
- The size of the archived file is read.
- Errors are written to the error log.
- The database is updated with the file size, archive path, and archive checksum, and the file is set as archived.
- Errors are written to the error log.
- This error does not halt ingestion.
- A message is sent back to the original RabbitMQ broker containing the upload user, upload file path, database file id, archive file path and checksum of the archived file.
Communication
Ingest
reads messages from one RabbitMQ queue (commonly:ingest
).Ingest
publishes messages to one RabbitMQ queue (commonly:archived
).Ingest
inserts file information in the database using three database functions,InsertFile
,StoreHeader
, andSetArchived
.Ingest
reads file data from inbox storage and writes data to archive storage.
Configuration
There are a number of options that can be set for the ingest
service.
These settings can be set by mounting a yaml-file at /config.yaml
with settings.
ex.
log:
level: "debug"
format: "json"
They may also be set using environment variables like:
export LOG_LEVEL="debug"
export LOG_FORMAT="json"
Keyfile settings
These settings control which crypt4gh keyfile is loaded.
C4GH_FILEPATH
: filepath to the crypt4gh keyfileC4GH_PASSPHRASE
: pass phrase to unlock the keyfile
RabbitMQ broker settings
These settings control how ingest
connects to the RabbitMQ message broker.
BROKER_HOST
: hostname of the RabbitMQ serverBROKER_PORT
: RabbitMQ broker port (commonly:5671
with TLS and5672
without)BROKER_QUEUE
: message queue to read messages from (commonly:ingest
)BROKER_ROUTINGKEY
: Routing key for publishing messages (commonly:archived
)BROKER_USER
: username to connect to RabbitMQBROKER_PASSWORD
: password to connect to RabbitMQBROKER_PREFETCHCOUNT
: Number of messages to pull from the message server at the time (default to2
)
PostgreSQL Database settings:
DB_HOST
: hostname for the postgresql databaseDB_PORT
: database port (commonly:5432
)DB_USER
: username for the databaseDB_PASSWORD
: password for the databaseDB_DATABASE
: database nameDB_SSLMODE
: The TLS encryption policy to use for database connections, valid options are:disable
allow
prefer
require
verify-ca
verify-full
More information is available in the postgresql documentation
Note that if DB_SSLMODE
is set to anything but disable
, then DB_CACERT
needs to be set,
and if set to verify-full
, then DB_CLIENTCERT
, and DB_CLIENTKEY
must also be set.
DB_CLIENTKEY
: key-file for the database client certificateDB_CLIENTCERT
: database client certificate fileDB_CACERT
: Certificate Authority (CA) certificate for the database to use
Storage settings
Storage backend is defined by the ARCHIVE_TYPE
, and INBOX_TYPE
variables.
Valid values for these options are S3
or POSIX
(Defaults to POSIX
on unknown values).
The value of these variables define what other variables are read.
The same variables are available for all storage types, differing by prefix (ARCHIVE_
, or INBOX_
)
if *_TYPE
is S3
then the following variables are available:
*_URL
: URL to the S3 system*_ACCESSKEY
: The S3 access and secret key are used to authenticate to S3, more info at AWS*_SECRETKEY
: The S3 access and secret key are used to authenticate to S3, more info at AWS*_BUCKET
: The S3 bucket to use as the storage root*_PORT
: S3 connection port (default:443
)*_REGION
: S3 region (default:us-east-1
)*_CHUNKSIZE
: S3 chunk size for multipart uploads.*_CACERT
: Certificate Authority (CA) certificate for the storage system, this is only needed if the S3 server has a certificate signed by a private entity
and if *_TYPE
is POSIX
:
- *_LOCATION
: POSIX path to use as storage root
Logging settings:
LOG_FORMAT
can be set tojson
to get logs in JSON format. All other values result in text logging.LOG_LEVEL
can be set to one of the following, in increasing order of severity:trace
debug
info
warn
(orwarning
)error
fatal
panic