On Amazon Web Services HealthOmics

Here's the cleaned-up and rewritten version of the provided markdown content:


On Amazon Web Services HealthOmics

AWS HealthOmics is a cloud-based service that provides a secure, compliant, and scalable platform for analyzing and sharing genomic data. It is designed to help researchers, clinicians, and other stakeholders securely store, analyze, and share genomic data. AWS HealthOmics is built on top of Amazon Web Services (AWS) and provides a suite of tools and services that enable users to analyze and share genomic data securely and at scale.

Workbench supports running workflows on AWS HealthOmics. This guide provides detailed instructions on setting up an engine for AWS HealthOmics in Workbench.

Backend Services

AWS HealthOmics

Setting up Your AWS Environment

To use AWS HealthOmics as an engine in Workbench, you need to set up an AWS account and configure the necessary permissions for it to access the required AWS services. Additionally, HealthOmics itself requires some configuration prior to its first use, to grant the service the necessary permissions to access your resources. For an in-depth guide on HealthOmics, please refer to the official user guide.

There are multiple ways to configure your account to use AWS HealthOmics as an engine in Workbench. You can either create the required resources manually through the console or the AWS CLI, or you can use the terraform engine installer provided by DNAstack. The simplest way to get started is to use the installer, which will create the necessary resources for you.

Using the Installer

For complete instructions including all of the configuration options available, see the installer README. The script has a number of defaults but requires minimal configuration to get started.

  1. Clone the installer repository:

git clone https://github.com/DNAstack/aws-healthomics-engine-installer.git
cd aws-healthomics-engine-installer
  1. Create a variables.tfvars file with the following content:

output_bucket_name = "my-healthomics-bucket"
region             = "ap-southeast-1"
  1. Initialize the Terraform environment:

terraform init
  1. Apply the configuration:

terraform apply -var-file=variables.tfvars
  1. Retrieve the output values from the Terraform state. Retain these values for later:

terraform output
  1. Retrieve the sensitive output values from the Terraform state. Retain these values for later and ensure they are kept in a secure location:

terraform output --raw secret_access_key

Creating Resources Directly

If you prefer to use the AWS console or the AWS CLI to create the necessary resources, the easiest way to get started is to follow the AWS HealthOmics Getting Started Guide. In addition to the resources required by HealthOmics, Workbench requires an access key and secret for an IAM User that has the necessary permissions to access the following services:

  • HealthOmics

  • CloudWatch

  • S3

You can create an IAM user by following the Managing Users guide and the Access Key and Secret Key by following the AWS Account and Access Keys guide. Once you have created the user, you will need to assign the necessary permissions to the user using an IAM policy similar to the following:

{
  "Statement": [
    {
      "Action": "iam:PassRole",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "omics.amazonaws.com"
        }
      },
      "Effect": "Allow",
      "Resource": "*",
      "Sid": "AllowPassRole"
    },
    {
      "Action": "omics:*",
      "Effect": "Allow",
      "Resource": "*",
      "Sid": "AllowOmicsActions"
    },
    {
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::${bucket_name}/*",
        "arn:aws:s3:::${bucket_name}"
      ],
      "Sid": "AllowS3ReadOnlyAccess"
    },
    {
      "Action": [
        "logs:GetLogEvents",
        "logs:DescribeLogStreams"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:logs:${region}:${account_id}:log-group:/aws/omics/*",
      "Sid": "AllowReadLogs"
    }
  ],
  "Version": "2012-10-17"
}

Configuring AWS HealthOmics as an Engine in Workbench

Once you have set up your AWS environment and configured the necessary permissions, you can add AWS HealthOmics as an engine to your Workbench account. To do this, you will need to provide the following information (in addition to the usual configuration):

  • Region: The AWS region where the HealthOmics service is deployed.

  • Access Key ID and Secret Access Key: These are the credentials for the IAM user that has the necessary permissions to access the required AWS services.

  • Output Bucket Name: The name of the S3 bucket where the outputs of the workflows will be stored.

  • Role ARN: The ARN of the IAM role that will be assumed by the HealthOmics service to access the required resources.

If you used the Installer script for setting up HealthOmics, you can retrieve the necessary information by running the terraform output command:

terraform output

Which should return something like the following:

access_key_id = "AK2813KA123MD01"
output_bucket = "my-healthomics-bucket"
role_arn = "arn:aws:iam::123456789:role/HealthOmicsRole"
secret_access_key = <sensitive>

To retrieve the value of the secret access key, you can run:

terraform output --raw secret_access_key
  1. From the AWS HealthOmics Engine Configuration page, ensure all of the general fields are filled out as per the usual configuration.

  2. Fill in the additional fields with the information retrieved from the Terraform output.

  3. Click Save to add the engine to your Workbench account.

Advanced Configuration

Installer Reference

The installer is highly configurable and allows you to customize the resources created for HealthOmics. Configuration is done by modifying values in you variables file.

output_bucket_name (required)

Name of the S3 bucket to store output data in without the s3:// prefix. This bucket will be created if it does not already exist.

output_bucket_name = "my-healthomics-bucket"

region (required)

AWS region to create the resources in and run workflows from. Since HealthOmics is region-specific, this should match one of the following regions: us-east-1, us-west-2, ap-southeast-1, eu-central-1, eu-west-1, eu-west-2, il-central-1

region = "ap-southeast-1"

additional_buckets

Name of additional S3 buckets to add permissions to read from for the service role and the generated IAM account. It is assumed that these buckets already exist, and are within the same account and same region as the HealthOmics service.

additional_buckets = ["my-second-bucket", "my-third-bucket"]

workbench_service_account_name

Name of the IAM user that will be generated for Workbench to access the AWS services. This defaults to workbench-health-omics, Change this if you want to use a different name.

workbench_service_account_name = "my-workbench-user"

health_omics_user_policy_name

The name of the policy to attach to the HealthOmics user. This policy will contain all the permissions needed by the generated IAM user to access AWS services. This defaults to HealthOmicsUserPolicy.

health_omics_user_policy_name = "MyHealthOmicsUserPolicy"

health_omics_service_policy_name

The name of the policy to attach to the HealthOmics service. This policy will contain all the permissions needed by HealthOmics to read from S3 buckets and write to CloudWatch. This defaults to HealthOmicsServicePolicy.

health_omics_service_policy_name = "MyHealthOmicsServicePolicy"

health_omics_role_name

The name of the IAM role to create for the HealthOmics service. This defaults to HealthOmicsRole.


health_omics_role_name = "MyHealthOmicsRole"

ecr_repositories

A list of ECR repository names to create and attach the appropriate IAM policies to. This is useful if you want to use HealthOmics to run workflows that require Docker images stored in ECR. For more information on configuring ECR repositories, and uploading images, see the Configuring Container Images section.

ecr_repositories = ["ubuntu", "my-custom-image"]

external_ecr_accounts

HealthOmics can be configured to read from ECR repositories in other accounts. This is useful if you want to use a central repository for your Docker images. This list should contain the account IDs of the accounts you want to allow HealthOmics to read from.

note:

This will only allow HealthOmics to pull images from the specified accounts. You will still need to configure the policies on the ECR repositories to allow HealthOmics to access them.

external_ecr_accounts = ["123456789012", "2222213132314"]

Configuring Container Images

HealthOmics uses container images to run workflows. All images must be located within a private container registry within the same region as the HealthOmics service. You can use the AWS Elastic Container Registry (ECR) to store your images.

To upload an image to ECR, you can use the following commands:

aws ecr create-repository --repository-name ubuntu --region ap-southeast-1
aws ecr get-login-password --region ap-southeast-1 | docker login --username AWS --password-stdin 123456789.dkr.ecr.ap-southeast-1.amazonaws.com
docker tag ubuntu:latest 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/ubuntu:latest
docker push 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/ubuntu:latest

Once a repository is created, you will need to ensure HealthOmics has the necessary permissions to access the images.

  1. Go to the AWS Console and navigate to the ECR service.

  2. Ensure you are in the correct region.

  3. Under Private registry, select Repositories.

  4. Select the repository you want to grant access to (ubuntu in the example above).

  5. Select the Permissions option from the sidebar.

  6. Click Edit Policy JSON and paste the following content into the editor, then click save.

    • Note: If there is already a policy, simply append the statements to the existing JSON.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "OmicsAccessPrincipal",
      "Effect": "Allow",
      "Principal": {
        "Service": "omics.amazonaws.com"
      },
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer"
      ]
    }
  ]
}

If you do not want to manually upload images to ECR, you can use the Amazon ECR Helper provided by the HealthOmics team, which allows you to copy images from a public registry to your private ECR.

Whenever you upload a new image to the registry, a separate repository is created for it. For example, if you want to upload a new version of an image, it will be stored in a new repository.

TroubleShooting

Workflow fails to submit: ECR Access Denied

ECR access denied (omics.amazonaws.com): 123124123123123.dkr.ecr.us-east-1.amazonaws.com/dockerhub

When writing a workflow that uses a Docker image stored in ECR, it is common practice to parameterize the image name within the WDL and pass it as an input to the workflow. This allows you to easily switch between different image repositories and help ensure a cloud and region-agnostic workflow.

task echo {
    input {
        String image_repository
    }
    command <<< echo "Hello, World!" >>>
    
    runtime {
        docker: "~{image_repository}/ubuntu:latest"
    }
}
  
workflow say_hello {
  input {
    String image_repository
  }
  call echo {
      input: image_repository = image_repository
  }
}

If you are using the Amazon ECR Helper to copy images or a pull through cache to sync images from upstream repositories, ECR adds a namespace prefix to the image name. For example, If you want to configure a pull through cache of dockerhub to pull the ubuntu:latest image, your input to the above workflow may look like the following:

{
  "inputs": {
    "say_hello.image_repository": "123124123123123.dkr.ecr.us-east-1.amazonaws.com/dockerhub"
  }
}

Unfortunately, this will not work as expected. HealthOmics will perform a preemptive check on any image repository that is passed as an input to the workflow. Since the repository name is actually dockerhub/ubuntu and dockerhub is the namespace the check fails and the workflow submission is rejected.

To fix this, you need to break the input into two fields, one for the namespace and one for the image name. This way, you can combine them in the WDL and pass the full repository name to the runtime.

workflow say_hello {
  input {
    String registry_name
    String? namespace
  }
  
  String image_repository = if defined(namespace) then "${registry_name}/${namespace}" else registry_name 
  
  call echo {
      input: image_repository = immage_repository
  }
}

Image in external account not accessible

 ERROR ECR image 123124123123123.dkr.ecr.us-east-1.amazonaws.com/ubuntu:latest is not accessible, not in the same account

If you are trying to run a workflow with an image in an external account, you may encounter the above error if the permissions are not set correctly WITHIN the current account. You need to explicitly grant the HealthOmics role permissions to submit requests to external ECR repositories in addition to granting the role permissions at the repository level.

The simplest way to do this is to add the external account to the external_ecr_accounts list in the installer script and re-run the installer. This will add the necessary permissions to the HealthOmics role to access the external ECR repositories.

Last updated

© DNAstack. All rights reserved.