Troubleshooting
This guide aims to provide general tips on troubleshooting and restoring services to a functional state.
Note
Please also check questions other users have faced: sensitive-data-archive Discussion We encourage starting a discussion there as it will help us improve current guides.
After deployment checklist
After having deployed the SDA services in a FederatedEGA
setup, the following steps can be followed to ensure that everything is up and running correctly.
Services running
The first step is to verify that the services are up and running and the credentials are valid. Make sure that:
- credentials for access to RabbitMQ and Postgres are securely injected to the respective services in the form of secrets
- all the pods/containers are in
Ready
/Up
status and and no restarts among the pods/containers.- for
FederatedEGA
setup the following pods are required:intercept
,ingest
,verify
,finalize
,mapper
and a Data Retrieval API - check the pods/container logs contain as the last message (after they have started):
time="2023-12-12T19:25:02Z" level=info msg="Starting <service-name> service"
forintercept
,ingest
,verify
,finalize
,mapper
- check the pods/container logs contain as the last message (after they have started)
time="2023-12-12T19:28:16Z" level=info msg="(5/5) Starting web server"
fordownload
Data Retrieval service
- for
Next step is to make sure that the remote connections (CentralEGA
RabbitMQ) are working. Login to the RabbitMQ admin page and check that:
- the Federation status of the Admin tab is in state
running
or usingrabbitmqctl federation_status
from the command line of a RabbitMQ pod/container. - the Shovel status of the Admin tab is in state
running
for all shovels or usingrabbitmqctl shovel_status
from the command line of a RabbitMQ pod/container.
End-to-end testing
Note
This guide assumes that there exists a test instance account with CentralEGA
. Make sure that the account is approved and added to the submitters group.
The local development and testing guide provides the scripts for testing different parts of the setup, that can be used
as a reference.
Upload file(s)
Upload one or a number of files of different sizes and check that,
- the file(s) exists in the configured
inbox
of the storage backend (e.g. S3 bucket or POSIX path) - the file(s) entry exists in the database in the
sda.files
andsda.file_event_log
tables - If the
s3inbox
is used, there should be anuploaded
event for each specific file in thesda.file_event_log
- the file(s) exists in the
CentralEGA
Submission portal (the submission portal URL address is specific for eachFederatedEGA
node).Files
listing, which can be accessed after pressing the three lines menu button.
Make a test submission
Make a submission with the portal and select the file(s) that were uploaded in the previous step. Once the analysis or runs (one of the two is required) step is finished, the messages for the ingestion of the files should appear in the logs of the ingest
service. Make sure that:
- the messages are arriving for the file(s) included in the submission
- the
ingestion
,verify
andfinalize
processes are started and send a message when finished - the data in
sda.files
table are correct - the files are logged in the
sda.file_event_log
table for each of the services and files - the file(s) exists in the configured
archive
storage backend, see thearchive_file_path
in thesda.files
table for the name of the archived file(s) - the archived file(s) exists in the configured
backup
storage backend - delete one run in the submitter portal, then and add it back again to make sure the cancel message is working as intended.
Finally, when all files have been ingested, the submission portal should allow for finalising the submission. The submission needs first to be accepted through a helpdesk portal. Once this step is done, make sure that,
- the message for the dataset arrives to the mapper service
- the dataset is created in the database and it includes the correct files by checking the
sda.datasets
andsda.file_dataset
tables. - the dataset has the status
registered
in thesda.dataset_event_log
- the dataset gets the status
released
in thesda.dataset_event_log
, this might take a while depending on what date was chosen in the submitter portal.
Upon the confirmation of all submission steps, it is reasonable to infer that the deployment's pipeline component is functioning correctly.