sync Service
The sync service is used in the Bigpicture project.
Copies files from the archive to the sync destination, including the header so that the files can be ingested at the remote site.
Service Description
The sync service copies files from the archive storage to sync storage.
When running, sync reads messages from the mapping_stream RabbitMQ queue.
For each message, these steps are taken (if not otherwise noted, errors halts progress, the message is Nack'ed, and the service moves on to the next message):
- The message is validated as valid JSON that matches the "dataset-mapping" schema. If the message can’t be validated it is sent to the error queue for later analysis.
- Checks where the dataset is created by comparing the center prefix on the dataset ID, if it is a remote ID processing stops.
- For each stable ID in the dataset the following is performed:
- The archive file path and file size is fetched from the database.
- The file size on disk is requested from the storage system.
- A file reader is created for the archive storage file, and a file writer is created for the sync storage file.
- The header is read from the database.
- The header is decrypted.
- The header is reencrypted with the destinations public key.
- The header is written to the sync file writer.
- The file data is copied from the archive file reader to the sync file writer.
- Once all files have been copied to the destination a JSON structure is created according to
file-syncschema. - A POST message is sent to the remote api host with the JSON data.
- The message is Ack'ed.
Communication
- Sync reads messages from one rabbitmq stream (
mapping_stream) - Sync reads file information and headers from the database and can not be started without a database connection.
- Sync re-encrypts the header with the receiving end's public key.
- Sync reads data from archive storage and writes data to sync destination storage with the re-encrypted headers attached.
Configuration
There are a number of options that can be set for the sync service.
These settings can be set by mounting a yaml-file at /config.yaml with settings.
ex.
log:
level: "debug"
format: "json"
They may also be set using environment variables like:
export LOG_LEVEL="debug"
export LOG_FORMAT="json"
Service settings
SYNC_CENTERPREFIX: Prefix of the dataset ID to detect if the dataset was minted locally or notSYNC_REMOTE_HOST: URL to the remote API hostSYNC_REMOTE_POST: Port for the remote API host, if other than the standard HTTP(S) portsSYNC_REMOTE_USER: Username for connecting to the remote APISYNC_REMOTE_PASSWORD: Password for the API user
Keyfile settings
These settings control which crypt4gh keyfile is loaded.
C4GH_FILEPATH: path to the crypt4gh keyfileC4GH_PASSPHRASE: pass phrase to unlock the keyfileC4GH_SYNCPUBKEYPATH: path to the crypt4gh public key to use for reencrypting file headers.
RabbitMQ broker settings
These settings control how sync connects to the RabbitMQ message broker.
BROKER_HOST: hostname of the rabbitmq serverBROKER_PORT: rabbitmq broker port (commonly5671with TLS and5672without)BROKER_QUEUE: message queue or stream to read messages from (commonlymapping_stream)BROKER_USER: username to connect to rabbitmqBROKER_PASSWORD: password to connect to rabbitmqBROKER_PREFETCHCOUNT: Number of messages to pull from the message server at the time (default to 2)
PostgreSQL Database settings
DB_HOST: hostname for the postgresql databaseDB_PORT: database port (commonly 5432)DB_USER: username for the databaseDB_PASSWORD: password for the databaseDB_DATABASE: database nameDB_SSLMODE: The TLS encryption policy to use for database connections. Valid options are:disableallowpreferrequireverify-caverify-full
More information is available in the postgresql documentation
Note that if DB_SSLMODE is set to anything but disable, then DB_CACERT needs to be set,
and if set to verify-full, then DB_CLIENTCERT, and DB_CLIENTKEY must also be set.
DB_CLIENTKEY: key-file for the database client certificateDB_CLIENTCERT: database client certificate fileDB_CACERT: Certificate Authority (CA) certificate for the database to use
Storage settings
The sync service requires access to the "archive" and "sync" storage backends. To configure these, the following configuration is required:
storage:
archive:
${STORAGE_IMPLEMENTATION}:
sync:
${STORAGE_IMPLEMENTATION}:
For more details on available configuration see storage/v2 README.md
Sync operates by reading file data from the "archive" backend and replicating it to the "sync" backend for all files associated with a dataset.
Logging settings
LOG_FORMATcan be set to “json” to get logs in json format. All other values result in text loggingLOG_LEVELcan be set to one of the following, in increasing order of severity:tracedebuginfowarn(orwarning)errorfatalpanic