Skip to content

Latest commit

 

History

History
137 lines (100 loc) · 6.85 KB

Getting_Started.md

File metadata and controls

137 lines (100 loc) · 6.85 KB

Getting Started

The GDC Application Programming Interface (API): An Overview

The GDC API drives the GDC Data and Submission Portals and provides programmatic access to GDC functionality. This includes searching for, downloading, and submitting data and metadata. The GDC API uses JSON as its communication format, and standard HTTP methods like GET, PUT, POST and DELETE.

This guide explains how to construct and execute API requests and interpret API responses.

Tools for communicating with the GDC API

Many third-party tools can be used for communicating with the GDC API and for preparing and visualizing API calls.

Examples of tools for communicating with the GDC API:

Tool Type
Curl Command line tool
HTTPie Command line tool
Postman REST Client App for Google Chrome and OS X
DHC REST Client Google Chrome extension
Google Chrome Google Chrome web browser

Examples of tools that can help build GDC API calls:

Tool Description
JSONLint Validate JSON
JSON Formatter Format, validate, and convert JSON to other formats
Percent-(URL)-encoding tool Tool for percent-encoding strings
JSON escape tool Tool for escaping strings using JSON string rules

API Endpoints

Communicating with the GDC API involves making calls to API endpoints. Each GDC API endpoint represents specific API functionality, as summarized in the following table:

Endpoint Type Description
status Status Get the API status and version information
projects Search & Retrieval Search all data generated by a project
cases Search & Retrieval Find all files related to a specific case, or sample donor.
files Search & Retrieval Find all files with specific characteristics such as file_name, md5sum, data_format and others.
annotations Search & Retrieval Search annotations added to data after curation
data Download Used to download GDC data
manifest Download Generates manifests for use with GDC Data Transfer Tool
slicing BAM Slicing Allows remote slicing of BAM format objects
submission Submission Returns the available resources at the top level above programs i.e., registered programs

The HTTP URL that corresponds to the latest version of a GDC API endpoint is https://api.gdc.cancer.gov/<endpoint>, where <endpoint> is the name of the endpoint.

The HTTP URL of an endpoint corresponding to a specific major version of the GDC API is https://api.gdc.cancer.gov/<version>/<endpoint>, where <endpoint> is the name of the endpoint and <version> is the GDC API version.

For example, the address of the latest version of the status endpoint is https://api.gdc.cancer.gov/status, whereas the address of the status endpoint corresponding to version 0 of GDC API is https://api.gdc.cancer.gov/v0/status.

GDC Legacy Archive

To interact with data in the GDC Legacy Archive, add legacy to the endpoint URL:

https://api.gdc.cancer.gov/<version>/legacy/<endpoint>

Entity UUIDs

All objects (entities) in the GDC are assigned a unique identifier in the form of a version 4 universally unique identifier (UUID). The UUID uniquely identifies the entity in the GDC, and is stored in the entity's id property.

UUIDs are frequently used in GDC API requests and responses to identify specific entities like files, cases, and samples.

See GDC Data Model for details.

Sample Request

The following is an example of a request to the files endpoint, which retrieves information about a BAM file stored in the GDC.

curl https://api.gdc.cancer.gov/files/d853e541-f16a-4345-9f00-88e03c2dc0bc?pretty=true
import requests
import json

file_endpt = 'https://api.gdc.cancer.gov/files/'
file_uuid = 'd853e541-f16a-4345-9f00-88e03c2dc0bc'
response = requests.get(file_endpt + file_uuid)
print json.dumps(response.json(), indent=2)
``` Response
{
  "data": {
    "data_type": "Aligned Reads",
    "updated_datetime": "2016-05-26T17:06:40.003624-05:00",
    "created_datetime": "2016-05-26T17:06:40.003624-05:00",
    "file_name": "0017ba4c33a07ba807b29140b0662cb1_gdc_realn.bam",
    "md5sum": "a08304b120c5df76b6532da0e9a35ced",
    "data_format": "BAM",
    "acl": [
      "phs000178"
    ],
    "access": "controlled",
    "platform": "Illumina",
    "state": "submitted",
    "file_id": "d853e541-f16a-4345-9f00-88e03c2dc0bc",
    "data_category": "Raw Sequencing Data",
    "file_size": 23650901931,
    "submitter_id": "c30188d7-be1a-4b43-9a17-e19ccd71792e",
    "type": "aligned_reads",
    "file_state": "processed",
    "experimental_strategy": "WXS"
  },
  "warnings": {}
}

Authentication

Authentication is required for downloading controlled-access data, and for all data submission functionality. The GDC API uses tokens for authentication.

Users can obtain authentication tokens from the GDC Data Portal and the GDC Data Submission Portal. See the GDC Data Portal User's Guide and the GDC Data Submission Portal User's Guide for instructions.

Using Authentication Tokens

All API requests that require authentication must include a token as an X-Auth-Token custom HTTP header.

In the following example, an authentication token is saved as an environment variable and passed to curl to download a controlled-access file:

token=$(<gdc-token-text-file.txt)

curl -O -J -H "X-Auth-Token: $token" 'https://api.gdc.cancer.gov/data/a1c1b23b-cc41-4e85-b1b7-62a42873c5af'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31.4M  100 31.4M    0     0   290k      0  0:01:50  0:01:50 --:--:--  172k
curl: Saved to filename 'ACOLD_p_TCGA_Batch17_SNP_N_GenomeWideSNP_6_A03_466078.tangent.copynumber.data.txt'

For more information about authentication tokens, including token expiration and rotation, see Data Security.

NOTE: The authentication token should be kept in a secure location, as it allows access to all data accessible by the associated user account.