On Amazon Web Services HealthOmics
Here's the cleaned-up and rewritten version of the provided markdown content:
On Amazon Web Services HealthOmics
AWS HealthOmics is a cloud-based service that provides a secure, compliant, and scalable platform for analyzing and sharing genomic data. It is designed to help researchers, clinicians, and other stakeholders securely store, analyze, and share genomic data. AWS HealthOmics is built on top of Amazon Web Services (AWS) and provides a suite of tools and services that enable users to analyze and share genomic data securely and at scale.
Workbench supports running workflows on AWS HealthOmics. This guide provides detailed instructions on setting up an engine for AWS HealthOmics in Workbench.
Backend Services
Setting up Your AWS Environment
To use AWS HealthOmics as an engine in Workbench, you need to set up an AWS account and configure the necessary permissions for it to access the required AWS services. Additionally, HealthOmics itself requires some configuration prior to its first use, to grant the service the necessary permissions to access your resources. For an in-depth guide on HealthOmics, please refer to the official user guide.
There are multiple ways to configure your account to use AWS HealthOmics as an engine in Workbench. You can either create the required resources manually through the console or the AWS CLI, or you can use the terraform engine installer provided by DNAstack. The simplest way to get started is to use the installer, which will create the necessary resources for you.
Using the Installer
For complete instructions including all of the configuration options available, see the installer README. The script has a number of defaults but requires minimal configuration to get started.
Clone the installer repository:
Create a
variables.tfvars
file with the following content:
Initialize the Terraform environment:
Apply the configuration:
Retrieve the output values from the Terraform state. Retain these values for later:
Retrieve the sensitive output values from the Terraform state. Retain these values for later and ensure they are kept in a secure location:
Creating Resources Directly
If you prefer to use the AWS console or the AWS CLI to create the necessary resources, the easiest way to get started is to follow the AWS HealthOmics Getting Started Guide. In addition to the resources required by HealthOmics, Workbench requires an access key and secret for an IAM User that has the necessary permissions to access the following services:
HealthOmics
CloudWatch
S3
You can create an IAM user by following the Managing Users guide and the Access Key and Secret Key by following the AWS Account and Access Keys guide. Once you have created the user, you will need to assign the necessary permissions to the user using an IAM policy similar to the following:
Configuring AWS HealthOmics as an Engine in Workbench
Once you have set up your AWS environment and configured the necessary permissions, you can add AWS HealthOmics as an engine to your Workbench account. To do this, you will need to provide the following information (in addition to the usual configuration):
Region: The AWS region where the HealthOmics service is deployed.
Access Key ID and Secret Access Key: These are the credentials for the IAM user that has the necessary permissions to access the required AWS services.
Output Bucket Name: The name of the S3 bucket where the outputs of the workflows will be stored.
Role ARN: The ARN of the IAM role that will be assumed by the HealthOmics service to access the required resources.
If you used the Installer script for setting up HealthOmics, you can retrieve the necessary information by running the terraform output
command:
Which should return something like the following:
To retrieve the value of the secret access key, you can run:
From the AWS HealthOmics Engine Configuration page, ensure all of the general fields are filled out as per the usual configuration.
Fill in the additional fields with the information retrieved from the Terraform output.
Click
Save
to add the engine to your Workbench account.
Advanced Configuration
Installer Reference
The installer is highly configurable and allows you to customize the resources created for HealthOmics. Configuration is done by modifying values in you variables file.
output_bucket_name
(required)
Name of the S3 bucket to store output data in without the s3://
prefix. This bucket will be created if it does not already exist.
region
(required)
AWS region to create the resources in and run workflows from. Since HealthOmics is region-specific, this should match one of the following regions: us-east-1, us-west-2, ap-southeast-1, eu-central-1, eu-west-1, eu-west-2, il-central-1
additional_buckets
Name of additional S3 buckets to add permissions to read from for the service role and the generated IAM account. It is assumed that these buckets already exist, and are within the same account and same region as the HealthOmics service.
workbench_service_account_name
Name of the IAM user that will be generated for Workbench to access the AWS services. This defaults to workbench-health-omics
, Change this if you want to use a different name.
health_omics_user_policy_name
The name of the policy to attach to the HealthOmics user. This policy will contain all the permissions needed by the generated IAM user to access AWS services. This defaults to HealthOmicsUserPolicy
.
health_omics_service_policy_name
The name of the policy to attach to the HealthOmics service. This policy will contain all the permissions needed by HealthOmics to read from S3 buckets and write to CloudWatch. This defaults to HealthOmicsServicePolicy
.
health_omics_role_name
The name of the IAM role to create for the HealthOmics service. This defaults to HealthOmicsRole
.
ecr_repositories
A list of ECR repository names to create and attach the appropriate IAM policies to. This is useful if you want to use HealthOmics to run workflows that require Docker images stored in ECR. For more information on configuring ECR repositories, and uploading images, see the Configuring Container Images section.
external_ecr_accounts
HealthOmics can be configured to read from ECR repositories in other accounts. This is useful if you want to use a central repository for your Docker images. This list should contain the account IDs of the accounts you want to allow HealthOmics to read from.
note:
This will only allow HealthOmics to pull images from the specified accounts. You will still need to configure the policies on the ECR repositories to allow HealthOmics to access them.
Configuring Container Images
HealthOmics uses container images to run workflows. All images must be located within a private container registry within the same region as the HealthOmics service. You can use the AWS Elastic Container Registry (ECR) to store your images.
To upload an image to ECR, you can use the following commands:
Once a repository is created, you will need to ensure HealthOmics has the necessary permissions to access the images.
Go to the AWS Console and navigate to the ECR service.
Ensure you are in the correct region.
Under Private registry, select Repositories.
Select the repository you want to grant access to (
ubuntu
in the example above).Select the Permissions option from the sidebar.
Click Edit Policy JSON and paste the following content into the editor, then click save.
Note: If there is already a policy, simply append the statements to the existing JSON.
If you do not want to manually upload images to ECR, you can use the Amazon ECR Helper provided by the HealthOmics team, which allows you to copy images from a public registry to your private ECR.
Whenever you upload a new image to the registry, a separate repository is created for it. For example, if you want to upload a new version of an image, it will be stored in a new repository.
TroubleShooting
Workflow fails to submit: ECR Access Denied
When writing a workflow that uses a Docker image stored in ECR, it is common practice to parameterize the image name within the WDL and pass it as an input to the workflow. This allows you to easily switch between different image repositories and help ensure a cloud and region-agnostic workflow.
If you are using the Amazon ECR Helper to copy images or a pull through cache to sync images from upstream repositories, ECR adds a namespace prefix to the image name. For example, If you want to configure a pull through cache of dockerhub
to pull the ubuntu:latest
image, your input to the above workflow may look like the following:
Unfortunately, this will not work as expected. HealthOmics will perform a preemptive check on any image repository that is passed as an input to the workflow. Since the repository name is actually dockerhub/ubuntu
and dockerhub
is the namespace the check fails and the workflow submission is rejected.
To fix this, you need to break the input into two fields, one for the namespace and one for the image name. This way, you can combine them in the WDL and pass the full repository name to the runtime.
Image in external account not accessible
If you are trying to run a workflow with an image in an external account, you may encounter the above error if the permissions are not set correctly WITHIN the current account. You need to explicitly grant the HealthOmics role permissions to submit requests to external ECR repositories in addition to granting the role permissions at the repository level.
The simplest way to do this is to add the external account to the external_ecr_accounts
list in the installer script and re-run the installer. This will add the necessary permissions to the HealthOmics role to access the external ECR repositories.
Last updated