Only available in the Repository view, the Advanced Search page offers complex query building capabilities to identify specific set of cases and files.
Advanced search allows, via Genomic Query Language (GQL), to use structured queries to search for files and cases.
A simple query in GQL (also known as a 'clause') consists of a field, followed by an operator, followed by one or more values. For example, the simple query cases.primary_site = Brain
will find all cases for projects in which the primary site is Brain:
Note that it is not possible to compare two fields (e.g. disease_type = project.name).
Note: GQL is not a database query language. For example, GQL does not have a "SELECT" statement.
When accessing Advanced Search from Repository View, a query created using facet filters in Repository View will be automatically translated to an Advanced Search GQL Query.
A query created in Advanced Search is not translated back to facet filters. Clicking on "Back to Facet Search" will return the user to Data View and reset the filters.
When opening the advanced search page (via the Repository view), the search field will be automatically populated with facets filters already applied (if any).
This default query can be removed by pressing "Reset".
Once the query has been entered and is identified as a "Valid Query", click on "Search" to run your query.
As a query is being written, the GDC Data Portal will analyze the context and offer a list of auto-complete suggestions. Auto-complete suggests both fields and values as described below.
The list of auto-complete suggestions includes all available fields matching the user text input. The user has to scroll down to see more fields in the dropdown:
The list of auto-complete suggestions includes top 100 values that match the user text input. The user has to scroll down to see more values in the dropdown.
The value auto-complete is not aware of the general context of the query, the system will display all available values in GDC for the selected field. It means the query could return 0 results depending of other filters.
Note: Quotes are automatically added to the value if it contains spaces.
You can use parentheses in complex GQL statements to enforce the precedence of operators.
For example, if you want to find all the open files in TCGA program as well as the files in TARGET program, you can use parentheses to enforce the precedence of the boolean operators in your query, i.e.:
(files.access = open and cases.project.program.name = TCGA) or cases.project.program.name = TARGET
Note: Without parentheses, the statement will be evaluated left-to-right.
A GQL keyword is a word that joins two or more clauses together to form a complex GQL query.
List of Keywords:
- AND
- OR
Note: parentheses can be used to control the order in which clauses are executed.
Used to combine multiple clauses, allowing you to refine your search.
Examples:
-
Find all open files in breast cancer
cases.project.primary_site = Breast and files.access = open
-
Find all open files in breast cancer and data type is copy number variation
cases.project.primary_site = Breast and files.access = open and files.data_type = "Copy number variation"
Used to combine multiple clauses, allowing you to expand your search.
Note: IN keyword can be an alternative to OR and result in simplified queries.
Examples:
-
Find all files that are raw sequencing data or raw microarray data:
files.data_type = "Raw microarray data" or files.data_type = "Raw sequencing data"
-
Find all files where donors are male or vital status is alive:
cases.demographic.gender = male or cases.diagnoses.vital_status = alive
An operator in GQL is one or more symbols or words comparing the value of a field on its left with one or more values on its right, such that only true results are retrieved by the clause.
Operator | Description |
---|---|
= | Field EQUAL Value (String or Number) |
!= | Field NOT EQUAL Value (String or Number) |
< | Field LOWER THAN Value (Number or Date) |
<= | Field LOWER THAN OR EQUAL Value (Number or Date) |
> | Field GREATER THAN Value (Number or Date) |
>= | Field GREATER THAN OR EQUAL Value (Number or Date) |
IN | Field IN [Value 1, Value 2] |
EXCLUDE | Field EXCLUDE [Value 1, Value 2] |
IS MISSING | Field IS MISSING |
NOT MISSING | Field NOT MISSING |
The "=" operator is used to search for files where the value of the specified field exactly matches the specified value.
Examples:
-
Find all files that are gene expression:
files.data_type = "Gene expression"
-
Find all cases whose gender is female:
cases.demographic.gender = female
The "!=" operator is used to search for files where the value of the specified field does not match the specified value.
The "!=" operator will not match a field that has no value (i.e. a field that is empty). For example, 'gender != male' will only match cases who have a gender and the gender is not male. To find cases other than male or with no gender populated, you would need to type gender != male or gender is missing.
Example:
-
Find all files with an experimental different from genotyping array:
files.experimental_strategy != "Genotyping array"
The ">" operator is used to search for files where the value of the specified field is greater than the specified value.
Example:
-
Find all cases whose number of days to death is greater than 60:
cases.diagnoses.days_to_death > 60
The ">=" operator is used to search for files where the value of the specified field is greater than or equal to the specified value.
Example:
-
Find all cases whose number of days to death is equal or greater than 60:
cases.diagnoses.days_to_death >= 60
The "<" operator is used to search for files where the value of the specified field is less than the specified value.
Example:
-
Find all cases whose age at diagnosis is less than 400 days:
cases.diagnoses.age_at_diagnosis < 400
The "<=" operator is used to search for files where the value of the specified field is less than or equal to the specified value.
Example:
-
Find all cases with a number of days to death less than or equal to 20:
cases.diagnoses.days_to_death <= 20
The "IN" operator is used to search for files where the value of the specified field is one of multiple specified values. The values are specified as a comma-delimited list, surrounded by brackets [ ].
Using "IN" is equivalent to using multiple 'EQUALS (=)' statements, but is shorter and more convenient. That is, typing 'project IN [ProjectA, ProjectB, ProjectC]' is the same as typing 'project = "ProjectA" OR project = "ProjectB" OR project = "ProjectC"'.
Examples:
-
Find all files in breast, breast and lung and cancer:
cases.project.primary_site IN [Brain, Breast,Lung]
-
Find all files tagged with exon or junction or hg19:
files.data_type IN ["Aligned reads", "Unaligned reads"]
The "EXCLUDE" operator is used to search for files where the value of the specified field is not one of multiple specified values.
Using "EXCLUDE" is equivalent to using multiple 'NOT_EQUALS (!=)' statements, but is shorter and more convenient. That is, typing 'project EXCLUDE [ProjectA, ProjectB, ProjectC]' is the same as typing 'project != "ProjectA" OR project != "ProjectB" OR project != "ProjectC"'
The "EXCLUDE" operator will not match a field that has no value (i.e. a field that is empty). For example, 'experimental strategy EXCLUDE ["WGS","WXS"]' will only match files that have an experimental strategy and the experimental strategy is not "WGS" or "WXS". To find files with an experimental strategy different from than "WGS" or "WXS" or is not assigned, you would need to type: files.experimental_strategy in ["WXS","WGS"] or files.experimental_strategy is missing.
Examples:
-
Find all files where experimental strategy is not WXS, WGS, Genotyping array:
files.experimental_strategy EXCLUDE [WXS, WGS, "Genotyping array"]
The "IS" operator can only be used with "MISSING". That is, it is used to search for files where the specified field has no value.
Examples:
-
Find all cases where gender is missing:
cases.demographic.gender is MISSING
The "NOT" operator can only be used with "MISSING". That is, it is used to search for files where the specified field has a value.
Examples:
-
Find all cases where race is not missing:
cases.demographic.race NOT MISSING
The date format should be the following: YYYY-MM-DD (without quotes).
Example:
files.updated_datetime > 2015-12-31
A value must be quoted if it contains a space. Otherwise the advanced search will not be able to interpret the value.
Quotes are not necessary if the value consists of one single word.
-
Example: Find all cases with primary site is brain and data type is copy number variation:
cases.project.primary_site = Brain and files.data_type = "Copy number variation"
The unit for age at diagnosis is in days. The user has to convert the number of years to number of days.
The conversion factor is 1 year = 365.25 days
-
Example: Find all cases whose age at diagnosis > 40 years old (40 * 365.25)
cases.diagnoses.age_at_diagnosis > 14610
The full list of fields available on the GDC Data Portal can be found through the GDC API using the following endpoint:
https://api.gdc.cancer.gov/gql/_mapping
Alternatively, a static list of fields is available below (not exhaustive).
- files.access
- files.acl
- files.archive.archive_id
- files.archive.revision
- files.archive.submitter_id
- files.center.center_id
- files.center.center_type
- files.center.code
- files.center.name
- files.center.namespace
- files.center.short_name
- files.data_format
- files.data_subtype
- files.data_type
- files.experimental_strategy
- files.file_id
- files.file_name
- files.file_size
- files.md5sum
- files.origin
- files.platform
- files.related_files.file_id
- files.related_files.file_name
- files.related_files.md5sum
- files.related_files.type
- files.state
- files.state_comment
- files.submitter_id
- files.tags
- cases.case_id
- cases.submitter_id
- cases.diagnoses.age_at_diagnosis
- cases.diagnoses.days_to_death
- cases.demographic.ethnicity
- cases.demographic.gender
- cases.demographic.race
- cases.diagnoses.vital_status
- cases.project.disease_type
- cases.project.name
- cases.project.program.name
- cases.project.program.program_id
- cases.project.project_id
- cases.project.state
- cases.samples.sample_id
- cases.samples.submitter_id
- cases.samples.sample_type
- cases.samples.sample_type_id
- cases.samples.shortest_dimension
- cases.samples.time_between_clamping_and_freezing
- cases.samples.time_between_excision_and_freezing
- cases.samples.tumor_code
- cases.samples.tumor_code_id
- cases.samples.current_weight
- cases.samples.days_to_collection
- cases.samples.days_to_sample_procurement
- cases.samples.freezing_method
- cases.samples.initial_weight
- cases.samples.intermediate_dimension
- cases.samples.is_ffpe
- cases.samples.longest_dimension
- cases.samples.oct_embedded
- cases.samples.pathology_report_uuid
- cases.samples.portions.analytes.a260_a280_ratio
- cases.samples.portions.analytes.aliquots.aliquot_id
- cases.samples.portions.analytes.aliquots.amount
- cases.samples.portions.analytes.aliquots.center.center_id
- cases.samples.portions.analytes.aliquots.center.center_type
- cases.samples.portions.analytes.aliquots.center.code
- cases.samples.portions.analytes.aliquots.center.name
- cases.samples.portions.analytes.aliquots.center.namespace
- cases.samples.portions.analytes.aliquots.center.short_name
- cases.samples.portions.analytes.aliquots.concentration
- cases.samples.portions.analytes.aliquots.source_center
- cases.samples.portions.analytes.aliquots.submitter_id
- cases.samples.portions.analytes.amount
- cases.samples.portions.analytes.analyte_id
- cases.samples.portions.analytes.analyte_type
- cases.samples.portions.analytes.concentration
- cases.samples.portions.analytes.spectrophotometer_method
- cases.samples.portions.analytes.submitter_id
- cases.samples.portions.analytes.well_number
- cases.samples.portions.center.center_id
- cases.samples.portions.center.center_type
- cases.samples.portions.center.code
- cases.samples.portions.center.name
- cases.samples.portions.center.namespace
- cases.samples.portions.center.short_name
- cases.samples.portions.is_ffpe
- cases.samples.portions.portion_id
- cases.samples.portions.portion_number
- cases.samples.portions.slides.number_proliferating_cells
- cases.samples.portions.slides.percent_eosinophil_infiltration
- cases.samples.portions.slides.percent_granulocyte_infiltration
- cases.samples.portions.slides.percent_inflam_infiltration
- cases.samples.portions.slides.percent_lymphocyte_infiltration
- cases.samples.portions.slides.percent_monocyte_infiltration
- cases.samples.portions.slides.percent_necrosis
- cases.samples.portions.slides.percent_neutrophil_infiltration
- cases.samples.portions.slides.percent_normal_cells
- cases.samples.portions.slides.percent_stromal_cells
- cases.samples.portions.slides.percent_tumor_cells
- cases.samples.portions.slides.percent_tumor_nuclei
- cases.samples.portions.slides.section_location
- cases.samples.portions.slides.slide_id
- cases.samples.portions.slides.submitter_id
- cases.samples.portions.submitter_id
- cases.samples.portions.weight