Python Library
DNAstack provides a Python client library called dnastack-client-library 3.1
. This can be used to interact with DNAstack using Python scripts and Jupyter Notebook.
Prerequisite
Python 3.8 or newer
pip
21.3 or newer
Only for Windows
PowerShell
Installation
See here for instructions on installing the library
Usage
Before we start...
Generally, some functions and methods automatically trigger the authorization process if required. However, they may allow anonymous access by simply setting the no_auth
argument to True
.
Set up a client factory with Explorer or a service registry
To get started, we will get the endpoints from the service registry by just specifying the hostname of the service with GA4GH Service Registry API.
In this example, we will set up a client factory with Viral AI (Explorer) with the use
function.
The use
method allows anonymous access by setting the no_auth
argument to True
. For example:
The factory
has two methods:
factory.all()
will give you the list ofdnastack.ServiceEndpoint
objects,factory.get(id: str)
is to instantiate a service client for the requested endpoint.
The factory.get
method relies on the type property of the ServiceEndpoint
object to determine which client class to use. Here is an example on how it does that.
It will instantiate a dnastack.CollectionServiceClient
for:
com.dnastack:collection-service:1.0.0
com.dnastack.explorer:collection-service:1.1.0
It will instantiate a dnastack.DataConnectClient
for:
org.ga4gh:data-connect:1.0.0
It will instantiate a dnastack.DrsClient
for:
org.ga4gh:drs:1.1.0
Interact with Collection Service API
Now that we get the information of the factory from the service registry, we can create a client to the collection service.
And this is how to list all available collections.
where slugName
is the alternative ID of a collection and itemsQuery
is the SQL query of items in the collection.
The list_collections
method allows anonymous access by setting the no_auth
argument to True
.
Set up a client for Data Connect Service
In this section, we switch to use a Data Connect client.
Suppose that you know which collection you want to work with. Then, use factory
to get the Data Connect client for the corresponding collection where the service ID is data-connect-<collection.slugName>
.
For example, if the collection is ncbi-sra
, it will look like this.
where data-connect-ncbi-sra
is the service ID of the Data Connect service that is corresponding to the collection.
List all accessible tables
Before we can run a query, we need to get the list of available tables (dnastack.client.data_connect.TableInfo
objects).
where the name
property of each item (TableInfo
object) in tables
is the name of the table that we can use in the query.
Note
Depending on the implementation of the /tables
endpoint, the TableInfo
object in the list may be incomplete, for example, the data model (data_model
) may only contain the reference URL, instead of an object schema. To get the more complete information, please use Table
which will be mentioned in the next section.
The list_tables
method allows anonymous access by setting the no_auth
argument to True
.
Get the table information and data
To get started, we need to use the table method, which returns a table wrapper object (dnastack.client.data_connect.Table
). In this example, we use the first table available.
The table method also takes a string where it assumes that the given string is the name of the table, e.g.,
or
A Table
object also has the name property, which is the table name (same as Table.name
). However, it provides two properties:
The
info
property provides the more complete table information as aTableInfo
object,The
data
property provides an iterator to the actual table data.
The table
method allows anonymous access by setting the no_auth
argument to True
.
Integrate a Table object with pandas.DataFrame
You can easily instantiate a pandas.DataFrame
object like shown below:
where table is a Table object.
Query data
Now, let’s say we will select up to 10 rows from the first table.
The query
method will return an iterator to the result where each item in the result is a string-to-anything dictionary.
The query
method allows anonymous access by setting the no_auth
argument to True
.
Integrate the query result (iterator) with pandas.DataFrame
You can easily instantiate a pandas.DataFrame
object like shown below:
Download blobs with DRS API
To download a blob, you need to find out the blobs that you have access to from a collection. To get the list of available blob items, you have to run the items query with a data connect client.
In this example, suppose that the first collection has blobs. We would like to get the first 20 blobs.
Tip
The items query may contain both “table” and “blob” items. You may want to filter them.
Here is how to get a blob object.
Tip
If you have external DRS URL, you can use it to by setting the url parameter instead of id. For example,
If the endpoint is publicly accessible, you can set no_auth to True to ensure that the client will never initate the authentication procedure.
Here is how to download the blob data.
where the data
property returns a byte array.
Integrate Blob objects with pandas.DataFrame
You can easily instantiate a pandas.DataFrame object like shown below:
where blob.get_download_url() returns the access URL.
Last updated