sync Service
The sync service is used in the Bigpicture project.
Copies files from the archive to the sync destination, including the header so that the files can be ingested at the remote site.
Service Description
The sync service copies files from the archive storage to sync storage.
When running, sync reads messages from the mapping_stream
RabbitMQ queue.
For each message, these steps are taken (if not otherwise noted, errors halts progress, the message is Nack'ed, and the service moves on to the next message):
- The message is validated as valid JSON that matches the "dataset-mapping" schema. If the message can’t be validated it is sent to the error queue for later analysis.
- Checks where the dataset is created by comparing the center prefix on the dataset ID, if it is a remote ID processing stops.
- For each stable ID in the dataset the following is performed:
- The archive file path and file size is fetched from the database.
- The file size on disk is requested from the storage system.
- A file reader is created for the archive storage file, and a file writer is created for the sync storage file.
- The header is read from the database.
- The header is decrypted.
- The header is reencrypted with the destinations public key.
- The header is written to the sync file writer.
- The file data is copied from the archive file reader to the sync file writer.
- Once all files have been copied to the destination a JSON structure is created according to
file-sync
schema. - A POST message is sent to the remote api host with the JSON data.
- The message is Ack'ed.
Communication
- Sync reads messages from one rabbitmq stream (
mapping_stream
) - Sync reads file information and headers from the database and can not be started without a database connection.
- Sync re-encrypts the header with the receiving end's public key.
- Sync reads data from archive storage and writes data to sync destination storage with the re-encrypted headers attached.
Configuration
There are a number of options that can be set for the sync service.
These settings can be set by mounting a yaml-file at /config.yaml
with settings.
ex.
log:
level: "debug"
format: "json"
They may also be set using environment variables like:
export LOG_LEVEL="debug"
export LOG_FORMAT="json"
Service settings
SYNC_CENTERPREFIX
: Prefix of the dataset ID to detect if the dataset was minted locally or notSYNC_REMOTE_HOST
: URL to the remote API hostSYNC_REMOTE_POST
: Port for the remote API host, if other than the standard HTTP(S) portsSYNC_REMOTE_USER
: Username for connecting to the remote APISYNC_REMOTE_PASSWORD
: Password for the API user
Keyfile settings
These settings control which crypt4gh keyfile is loaded.
C4GH_FILEPATH
: path to the crypt4gh keyfileC4GH_PASSPHRASE
: pass phrase to unlock the keyfileC4GH_SYNCPUBKEYPATH
: path to the crypt4gh public key to use for reencrypting file headers.
RabbitMQ broker settings
These settings control how sync connects to the RabbitMQ message broker.
BROKER_HOST
: hostname of the rabbitmq serverBROKER_PORT
: rabbitmq broker port (commonly5671
with TLS and5672
without)BROKER_QUEUE
: message queue or stream to read messages from (commonlymapping_stream
)BROKER_USER
: username to connect to rabbitmqBROKER_PASSWORD
: password to connect to rabbitmqBROKER_PREFETCHCOUNT
: Number of messages to pull from the message server at the time (default to 2)
PostgreSQL Database settings
DB_HOST
: hostname for the postgresql databaseDB_PORT
: database port (commonly 5432)DB_USER
: username for the databaseDB_PASSWORD
: password for the databaseDB_DATABASE
: database nameDB_SSLMODE
: The TLS encryption policy to use for database connections. Valid options are:disable
allow
prefer
require
verify-ca
verify-full
More information is available in the postgresql documentation
Note that if DB_SSLMODE
is set to anything but disable
, then DB_CACERT
needs to be set,
and if set to verify-full
, then DB_CLIENTCERT
, and DB_CLIENTKEY
must also be set.
DB_CLIENTKEY
: key-file for the database client certificateDB_CLIENTCERT
: database client certificate fileDB_CACERT
: Certificate Authority (CA) certificate for the database to use
Storage settings
Storage backend is defined by the ARCHIVE_TYPE
, and SYNC_DESTINATION_TYPE
variables.
Valid values for these options are S3
or POSIX
for ARCHIVE_TYPE
and POSIX
, S3
or SFTP
for SYNC_DESTINATION_TYPE
.
The value of these variables define what other variables are read.
The same variables are available for all storage types, differing by prefix (ARCHIVE_
, or SYNC_DESTINATION_
)
if *_TYPE
is S3
then the following variables are available:
*_URL
: URL to the S3 system*_ACCESSKEY
: The S3 access and secret key are used to authenticate to S3, more info at AWS*_SECRETKEY
: The S3 access and secret key are used to authenticate to S3, more info at AWS*_BUCKET
: The S3 bucket to use as the storage root*_PORT
: S3 connection port (default:443
)*_REGION
: S3 region (default:us-east-1
)*_CHUNKSIZE
: S3 chunk size for multipart uploads.*_CACERT
: Certificate Authority (CA) certificate for the storage system, CA certificate is only needed if the S3 server has a certificate signed by a private entity
if *_TYPE
is POSIX
:
*_LOCATION
: POSIX path to use as storage root
and if *_TYPE
is SFTP
:
*_HOST
: URL to the SFTP server*_PORT
: Port of the SFTP server to connect to*_USERNAME
: Username connection to the SFTP server*_HOSTKEY
: The SFTP server's public key*_PEMKEYPATH
: Path to the ssh private key used to connect to the SFTP server*_PEMKEYPASS
: Passphrase for the ssh private key
Logging settings
LOG_FORMAT
can be set to “json” to get logs in json format. All other values result in text loggingLOG_LEVEL
can be set to one of the following, in increasing order of severity:trace
debug
info
warn
(orwarning
)error
fatal
panic