finalize Service
Handles the so-called Accession ID (stable ID) to filename mappings from CentralEGA
.
At the same time the service fulfills the replication requirement of having distinct backup copies.
For more information see Federated EGA Node Operations v2 document.
Service Description
Finalize
adds stable, shareable Accession ID's to archive files.
If a backup location is configured it will perform backup of a file.
When running, finalize
reads messages from the configured RabbitMQ queue (commonly: accession
).
For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message):
- The message is validated as valid JSON that matches the
ingestion-accession
schema.- If the message can’t be validated it is discarded with an error message in the logs.
- If the service is configured to perform backups i.e. the
ARCHIVE_
andBACKUP_
storage backend are set. Archived files will be copied to the backup location. - The file size on disk is requested from the storage system.
- The database file size is compared against the disk file size.
- A file reader is created for the archive storage file, and a file writer is created for the backup storage file.
- The file data is copied from the archive file reader to the backup file writer.
- If the type of the
DecryptedChecksums
field in the message issha256
, the value is stored. - A new RabbitMQ
complete
message is created and validated against theingestion-completion
schema.- If the validation fails, an error message is written to the logs.
- The file accession ID in the message is marked as ready in the database.
- On error the service sleeps for up to 5 minutes to allow for database recovery, after 5 minutes the message is Nacked, re-queued and an error message is written to the logs.
- The complete message is sent to RabbitMQ. On error, a message is written to the logs.
- The original RabbitMQ message is Ack'ed.
Communication
Finalize
reads messages from one RabbitMQ queue (commonly:accession
).Finalize
publishes messages with one routing key (commonly:completed
).Finalize
assigns the accession ID to a file in the database using theSetAccessionID
function.
Configuration
There are a number of options that can be set for the finalize
service.
These settings can be set by mounting a yaml-file at /config.yaml
with settings.
ex.
log:
level: "debug"
format: "json"
They may also be set using environment variables like:
export LOG_LEVEL="debug"
export LOG_FORMAT="json"
RabbitMQ broker settings
These settings control how finalize
connects to the RabbitMQ message broker.
BROKER_HOST
: hostname of the RabbitMQ serverBROKER_PORT
: RabbitMQ broker port (commonly:5671
with TLS and5672
without)BROKER_QUEUE
: message queue to read messages from (commonly:accession
)BROKER_ROUTINGKEY
: Routing key for publishing messages (commonly:completed
)BROKER_USER
: username to connect to RabbitMQBROKER_PASSWORD
: password to connect to RabbitMQBROKER_PREFETCHCOUNT
: Number of messages to pull from the message server at the time (default to2
)
PostgreSQL Database settings
DB_HOST
: hostname for the postgresql databaseDB_PORT
: database port (commonly:5432
)DB_USER
: username for the databaseDB_PASSWORD
: password for the databaseDB_DATABASE
: database nameDB_SSLMODE
: The TLS encryption policy to use for database connections, valid options are:disable
allow
prefer
require
verify-ca
verify-full
More information is available in the postgresql documentation
Note that if DB_SSLMODE
is set to anything but disable
, then DB_CACERT
needs to be set,
and if set to verify-full
, then DB_CLIENTCERT
, and DB_CLIENTKEY
must also be set.
DB_CLIENTKEY
: key-file for the database client certificateDB_CLIENTCERT
: database client certificate fileDB_CACERT
: Certificate Authority (CA) certificate for the database to use
Logging settings
LOG_FORMAT
can be set tojson
to get logs in JSON format. All other values result in text logging.LOG_LEVEL
can be set to one of the following, in increasing order of severity:trace
debug
info
warn
(orwarning
)error
fatal
panic
Storage settings
Storage backend is defined by the ARCHIVE_TYPE
, and BACKUP_TYPE
variables.
Valid values for these options are S3
or POSIX
(Defaults to POSIX
on unknown values).
The value of these variables define what other variables are read.
The same variables are available for all storage types, differing by prefix (ARCHIVE_
, or BACKUP_
)
if *_TYPE
is S3
then the following variables are available:
*_URL
: URL to the S3 system*_ACCESSKEY
: The S3 access and secret key are used to authenticate to S3, more info at AWS*_SECRETKEY
: The S3 access and secret key are used to authenticate to S3, more info at AWS*_BUCKET
: The S3 bucket to use as the storage root*_PORT
: S3 connection port (default:443
)*_REGION
: S3 region (default:us-east-1
)*_CHUNKSIZE
: S3 chunk size for multipart uploads.*_CACERT
: Certificate Authority (CA) certificate for the storage system, this is only needed if the S3 server has a certificate signed by a private entity
and if *_TYPE
is POSIX
:
*_LOCATION
: POSIX path to use as storage root