LogoLogo
  • Overview
  • publisher
    • Introduction
    • Getting Started
      • Logging in to Publisher
    • Data Sources
      • Connecting a Data Source
      • Managing a Data Source
      • Connectors
        • AWS S3 Permissions
        • Connecting to AWS S3 Storage
        • Google Cloud Storage (GCS) Permissions
        • Connecting to Google Cloud Storage
        • PostgreSQL Permissions
        • Connecting to PostgreSQL
        • PostgreSQL on Azure Permissions
        • Microsoft Azure Blob Storage Permissions
        • Connecting to Microsoft Azure Blob Storage
        • Connecting to HTTPS
        • Connecting to other sources via Trino
          • BigQuery
    • Collections
      • Creating a Collection
      • Sharing a Collection
      • Collection Filters
      • Editing Collection Metadata
      • Updating Collection Contents
    • Access Policies
      • Creating an Access Policy
      • Managing Access Policies
    • Questions
      • Adding Questions
      • Example Question
    • Settings
      • Viewing Current and Past Administrators
      • Adding an Administrator
      • Removing an Administrator
      • Setting Notification Preferences
  • Explorer
    • Introduction
    • Viewing a Collection
    • Browsing Collections
    • Asking Questions
    • Accessing a Private Collection
      • Requesting Access to a Private Collection
    • Filtering Data in Tables
      • Strings
      • Dates
      • Numbers
  • Workbench
    • Introduction
    • Getting Started
      • Logging into Workbench
      • Connecting an Engine
      • Finding or Importing a Workflow
      • Configuring Workflow Inputs
      • Running and Monitoring a Workflow
      • Locating Outputs
    • Engines
      • Adding and Updating an Engine
        • On AWS HealthOmics
        • On Microsoft Azure
        • On Google Cloud Platform
        • On Premises
      • Parameters
        • AWS HealthOmics
        • Google Cloud Platform
        • Microsoft Azure
        • On-Premises
        • Cromwell
        • Amazon Genomics CLI
    • Workflows
      • Finding Workflows
      • Adding a Workflow
      • Supported Languages
      • Repositories
        • Dockstore
    • Instruments
      • Getting Started with Instruments
      • Connecting a Storage Account
      • Using Sample Data in a Workflow
      • Running Workflows Using Samples
      • Family Based Analysis with Pedigree Information
      • Monitor the Workflow
      • CLI Reference
        • Instruments
        • Storage
        • Samples
        • OpenAPI Specification
    • Entities
    • Terminology
  • Passport
    • Introduction
    • Registering an Email Address for a Google Identity
  • Command Line Interface
    • Installation
    • Usage Examples
    • Working with JSON Data
    • Reference
      • workbench
        • runs submit
        • runs list
        • runs describe
        • runs cancel
        • runs delete
        • runs logs
        • runs tasks list
        • runs events list
        • engines list
        • engines describe
        • engines parameters list
        • engines parameters describe
        • engines health-checks list
        • workflows create
        • workflows list
        • workflows describe
        • workflows update
        • workflows delete
        • workflows versions create
        • workflows versions list
        • workflows versions describe
        • workflows versions files
        • workflows versions update
        • workflows versions delete
        • workflows versions defaults create
        • workflows versions defaults list
        • workflows versions defaults describe
        • workflows versions defaults update
        • workflows versions defaults delete
        • namespaces get-default
        • storage add
        • storage delete
        • storage describe
        • storage list
        • storage update
        • storage platforms add
        • storage platforms delete
        • storage platforms describe
        • storage platforms list
        • samples list
        • samples describe
        • samples files list
      • publisher
        • datasources list
  • Analysis
    • Python Library
    • Popular Environments
      • Cromwell
      • CWL Tool
      • Terra
      • Nextflow
      • DNAnexus
Powered by GitBook

© DNAstack. All rights reserved.

On this page
  • Prerequisite
  • Installation
  • Usage
  • Set up a client factory with Explorer or a service registry
  • Interact with Collection Service API
  • Set up a client for Data Connect Service
  • List all accessible tables
  • Get the table information and data
  • Integrate a Table object with pandas.DataFrame
  • Query data
  • Download blobs with DRS API

Was this helpful?

  1. Analysis

Python Library

PreviousAnalysisNextPopular Environments

Last updated 3 months ago

Was this helpful?

DNAstack provides a Python client library called dnastack-client-library 3.1. This can be used to interact with DNAstack using Python scripts and Jupyter Notebook.

Prerequisite

  • Python 3.8 or newer

  • pip 21.3 or newer

Only for Windows

  • PowerShell

Installation

See for instructions on installing the library

Usage

Before we start...

Generally, some functions and methods automatically trigger the authorization process if required. However, they may allow anonymous access by simply setting the no_auth argument to True.

Set up a client factory with Explorer or a service registry

To get started, we will get the endpoints from the service registry by just specifying the hostname of the service with GA4GH Service Registry API.

In this example, we will set up a client factory with Viral AI (Explorer) with the use function.

from dnastack import use

factory = use('viral.ai')

The use method allows anonymous access by setting the no_auth argument to True. For example:

factory = use('viral.ai', no_auth=True)

The factory has two methods:

  • factory.all() will give you the list of dnastack.ServiceEndpoint objects,

  • factory.get(id: str) is to instantiate a service client for the requested endpoint.

The factory.get method relies on the type property of the ServiceEndpoint object to determine which client class to use. Here is an example on how it does that.

It will instantiate a dnastack.CollectionServiceClient for:

  • com.dnastack:collection-service:1.0.0

  • com.dnastack.explorer:collection-service:1.1.0

It will instantiate a dnastack.DataConnectClient for:

  • org.ga4gh:data-connect:1.0.0

It will instantiate a dnastack.DrsClient for:

  • org.ga4gh:drs:1.1.0

Interact with Collection Service API

Now that we get the information of the factory from the service registry, we can create a client to the collection service.

from dnastack import CollectionServiceClient
collection_service_client = factory.get_one_of(client_class=CollectionServiceClient)

And this is how to list all available collections.

import json

collections = collection_service_client.list_collections()

print(json.dumps(
    [
        {
            'id': c.id,
            'slugName': c.slugName,
            'itemsQuery': c.itemsQuery,
        }
        for c in collections
    ],
    indent=2
))

where slugName is the alternative ID of a collection and itemsQuery is the SQL query of items in the collection.

The list_collections method allows anonymous access by setting the no_auth argument to True.

Set up a client for Data Connect Service

In this section, we switch to use a Data Connect client.

Suppose that you know which collection you want to work with. Then, use factory to get the Data Connect client for the corresponding collection where the service ID is data-connect-<collection.slugName>.

from dnastack import DataConnectClient

data_connect_client: DataConnectClient = factory.get('data-connect-<collection.slugName>')

For example, if the collection is ncbi-sra, it will look like this.

data_connect_client: DataConnectClient = factory.get('data-connect-ncbi-sra')

where data-connect-ncbi-sra is the service ID of the Data Connect service that is corresponding to the collection.

List all accessible tables

Before we can run a query, we need to get the list of available tables (dnastack.client.data_connect.TableInfo objects).

ables = data_connect_client.list_tables()

print(json.dumps(
    [
        dict(
            name=table.name
        )
        for table in tables
    ],
    indent=2
))

where the name property of each item (TableInfo object) in tables is the name of the table that we can use in the query.

Note

Depending on the implementation of the /tables endpoint, the TableInfo object in the list may be incomplete, for example, the data model (data_model) may only contain the reference URL, instead of an object schema. To get the more complete information, please use Table which will be mentioned in the next section.

The list_tables method allows anonymous access by setting the no_auth argument to True.

Get the table information and data

To get started, we need to use the table method, which returns a table wrapper object (dnastack.client.data_connect.Table). In this example, we use the first table available.

table = data_connect_client.table(tables[0])

The table method also takes a string where it assumes that the given string is the name of the table, e.g.,

table = data_connect_client.table(tables[0].name)

or

table = data_connect_client.table('cat.sch.tbl')

A Table object also has the name property, which is the table name (same as Table.name). However, it provides two properties:

  • The info property provides the more complete table information as a TableInfo object,

  • The data property provides an iterator to the actual table data.

The table method allows anonymous access by setting the no_auth argument to True.

Integrate a Table object with pandas.DataFrame

You can easily instantiate a pandas.DataFrame object like shown below:

import pandas

csv_df = pandas.DataFrame(table.data)

where table is a Table object.

Query data

Now, let’s say we will select up to 10 rows from the first table.

result_iterator = data_connect_client.query(f'SELECT * FROM {table.name} LIMIT 10')

The query method will return an iterator to the result where each item in the result is a string-to-anything dictionary.

The query method allows anonymous access by setting the no_auth argument to True.

Integrate the query result (iterator) with pandas.DataFrame

You can easily instantiate a pandas.DataFrame object like shown below:

import pandas

csv_df = pandas.DataFrame(result_iterator)

Download blobs with DRS API

To download a blob, you need to find out the blobs that you have access to from a collection. To get the list of available blob items, you have to run the items query with a data connect client.

In this example, suppose that the first collection has blobs. We would like to get the first 20 blobs.

blob_collection = [c for c in collections if c.slugName == 'ncbi-sra'][0]
items = [i
         for i in data_connect_client.query(blob_collection.itemsQuery + ' LIMIT 20')
         if i['type'] == 'blob']

Tip

The items query may contain both “table” and “blob” items. You may want to filter them.

Here is how to get a blob object.

from dnastack import DrsClient

drs_client: DrsClient = factory.get("drs")
blob = drs_client.get_blob(items[0]['id'])

Tip

If you have external DRS URL, you can use it to by setting the url parameter instead of id. For example,

blob = drs_client.get_blob('drs://viral.ai/fmyfkmy1230-3rhbfa8weyf')

If the endpoint is publicly accessible, you can set no_auth to True to ensure that the client will never initate the authentication procedure.

blob = drs_client.get_blob(..., no_auth=True)

Here is how to download the blob data.

blob.data

where the data property returns a byte array.

Integrate Blob objects with pandas.DataFrame

You can easily instantiate a pandas.DataFrame object like shown below:

import pandas

csv_df = pandas.read_csv(blob.get_download_url())

where blob.get_download_url() returns the access URL.

here