NCG collects 2,372 cancer genes from different sources:
  • Cancer Gene Census (CGC) and Vogelstein studies
  • Screenings of predefined gene panels
  • Whole-exome sequencing (WES) of cancer samples
  • Whole-genome sequencing (WGS) of cancer samples
For each cancer gene, NCG provides information on:
  • duplicability
  • orthology
  • evolutionary appearance
  • protein-protein interactions
  • miRNA-gene interactions
  • functional properties
  • gene expression
  • essentiality

Search

From the homepage the user may retrieve the information from the database in several ways:


Gene Search

The user may give a single gene identifier or a list of gene identifiers, to be chosen among four possibilities:

  1. Gene symbol: to query for list of genes, use * (e.g. MDM* will display 3 genes: MDM2, MDM4, MDM1);
  2. Entrez identifier (e.g. 5728);
  3. RefSeq protein identifier (e.g. NP_000305);
  4. Ensembl protein identifier (e.g. ENSP00000418960);
  5. Ensembl gene identifier (e.g. ENSG00000171862);
  6. All cancer genes within a genomic region (it is possible to either select a chromosome or the genomic coordinates, human genome hg38).

In addition, a list of cancer genes, screenings and false possibles can be browsed and an advanced search can be conducted.


Screenings

The user may choose to retrieve the list of genes for any one of the studies in the following sources:

  1. Known Cancer Genes;
  2. Screenings of predefined gene panels;
  3. Whole-exome sequencing (WES) of cancer samples;
  4. Whole-genome sequencing (WGS) of cancer samples.

Advanced Search

The advanced search allows the user to analyze lists of cancer genes with similar properties based on user-defined filters.

The filters are based on the following properties:

  1. Screenings: Allows users to select only the cancer genes reported in one or more screenings;
  2. Primary Sites: Defined by the organ / tissue in which the genes are reported to be involved in cancer. Default selection is all primary sites;
  3. Cancer Types: Defined by the specific cancer type (narrower search than by primary site). Default selection is all cancer types;
  4. Protein Function (Reactome): Based on the level 1 pathways from Reactome that the cancer genes are involved in. This filter is not applied by default;
  5. Appearance in Evolution: Based on the origin of the genes. This filter is not applied by default;
  6. Duplicability: Duplicated or singleton genes. This filter is not applied by default;
  7. PPIN: Hubs or non-hubs. Hubs are defined as the top 25% most connected nodes in the human protein interaction network. In our case, hubs correspond to genes with degree higher than 41. This filter is not applied by default;
  8. microRNA-gene Network: Hubs or non-hubs. Hubs are defined as the top 25% most targeted genes in the miRNA-gene interaction network. In our case, hubs correspond to genes targeted by more than 35 miRNAs. This filter is not applied by default.

Possible False Positives

The list of 250 possible false positives have been collected based on:

  1. Functional irrelevance (i.e. olfactory receptor genes);
  2. Gene length (i.e. long exons and/or introns);
  3. Literature evidence (as reported in Lawrence, Nature 2013 and Bailey, Cell 2018);


Results

The results page contains nine sections for each gene:


Gene Description

This section includes the general information about the queried gene: symbol, description and links to external databases, such as Entrez, COSMIC, OMIM, RefSeq, Ensembl.


Cancer Information

The button Details opens a new page, containing a list of screenings in which the gene has been reported as a cancer driver.


Duplicability

Duplicability is defined as in Rambaldi D et al. (2008): it is measured by aligning the corresponding protein sequences directly to the human genome, using the BLAST-like Aligment Tool (BLAT). We define as duplicates all additional genomic matches covering at least 60% of the query length. Singletons are all those genes, which do not have any additional hit above 60% of the query length.

The button Duplicability opens a new page, which describes all the duplicated loci related to the studied gene.


Orthology

The appearance of a gene is defined as the deepest taxonomic branch of the tree of life where an ortholog can be detected. In order to retrieve orthology relationships eggNOG 4.5.1 (Huerta-Cepas et al., 2016) is used.

Seven branches of the tree of life are defined:

  1. Last Universal Common Ancestor (LUCA)
  2. Eukaryotes
  3. Opisthokonts
  4. Metazoans
  5. Vertebrates
  6. Mammals
  7. Primates

The button details opens a new page, which describes all the orthology relationships of the gene of interest in detail.


Network Properties

This section provides information on the interactions of the protein encoded by the gene of interest. It provides the number of human proteins interacting with the protein of interest and the number of complexes it is part of. The button details opens a new page which describes all the network properties and complexes in detail.


The network properties are derived from four major databases of Protein-protein interaction networks:

Dataset Version Nodes Interactions Publications
BioGRID 3.4.157 (February 2018) 15,893 260,523 24,662
IntAct 4.2.10 (Jan 27th 2018) 11,802 59,910 6,791
DIP February 5th 2018 2,879 4,511 1,919
HPRD 9 (Apr 13th 2010) 9,305 36,185 18,539
Total 16,322 289,368 36,561


The complex interactions are derived from three major databases:

Dataset Version Proteins Complexes Publications
CORUM July 2nd 2017 3,146 2,382 1,585
HPRD 9 (April 13th 2010) 2,696 1,521 1,278
Reactome 63 (December 18th 2017) 6,626 4,315 7,912
Total 8,080 8,218 10,126


miRNA Information

The number of miRNAs regulating the gene is reported.

The button details opens a new page which shows the graphical representation of all the miRNAs regulating the queried gene, along with the other genes regulated by the same miRNAs.


Protein Function

The functional classes of the genes are enlisted.

The button Protein Function opens a new page, which describes gene functions along with GO ids and terms.


Gene Expression in Normal Tissues

The number of Normal Tissues where the gene is expressed is shown.

The button Gene Expression opens a new page, which shows further information on the expression of the gene across normal tissues.


Gene Expression in Cancer Cell Lines

The number of Cancer Cell Lines where the gene is expressed is shown.

The button Gene Expression opens a new page, which shows further information on the expression of the gene across cancer cell lines.


Essentiality

In this section, the number of human cell lines in which the gene has been found essential is reported. The button details opens a new page, which describes all the info related to the studied gene.

Essentiality is derived from two databases:

Dataset Version Genes Cell lines Average cell lines per gene
OGEE May 2018 18,975 7 6.9
PICKLES Sep 22nd 2017 20,749 174 125.5


Cancer Information

This page provides a list of screenings in which the gene has been reported as driver.


The summary box gives an overview of the sources which report the gene as a cancer driver. This includes whether or not the gene is a known cancer gene (reported in the Census of Cancer Genes or the Vogelstein list of cancer genes), and how many publications, cancer types and primary sites it is associated with. It also indicates whether candidate genes have strong support. NCG inludes two lists of candidate cancer genes with strong support (available for download):

  • Candidates that were reported by more than one mutational screening within the same primary site (or more than one pan-cancer screening),
  • Candidates that were reported by a study involving at least 140 cancer donors (top 25% of studies by sample size).

Furthermore, potential false positives are not included in either list of candidates with strong support.

In the table, the first column, Type of Screening, describes the type of screening in which the gene is reported. This can be one of the following:

The second and third columns describe the Primary Site and the Cancer Type where the gene is reported as a driver. For a summary of the primary sites and cancer types annotated in NCG, see the statistics page.

The fourth column, Method describes the method used in the original screening to determine that the gene is a driver. This can be one of the following:

The last column, Reference, provides a link to the screening where the gene was reported as a cancer driver.


Duplicability

Duplicability is defined as in Rambaldi D et al. (2008): it is measured by aligning the corresponding protein sequences directly to the human genome (hg38), using the BLAST-like Aligment Tool (BLAT). We define as duplicates all additional genomic matches covering at least 60% of the query length. Singletons are all those genes, which do not have any additional hit above 60% of the query length.

Three types of Hit are defined, depending on the genomic location of the duplicated locus:

  • Best Hit, which corresponds to the original gene locus;
  • Other Gene Hits, which include other gene loci where the gene of interest is duplicated;
  • Genomic, which include loci with no known genes mapped (no genes are defined by the UCSC Genome Browser, but mRNAs or ESTs may be present).

The default cutoff to display genomic hits is 60% of the original length, but the user is allowed to choose different cutoffs from the widget next to the table. The range of choice varies from 10% of the query length to 100%.


Orthology

The orthology relationships are derived from eggNOG 4.5.1 (Huerta-Cepas et al., 2016).


Tree Of Life

The Tree of Life provides a visualization of the origin and the orthologs of the gene of interest. The origin of the gene is represented by red color and the presence of orthologs in yellow. The nodes that do not have any orthologs of the genes of interest, are depicted in white.

Clicking on the node of interest displays a short description of the node in the Orthology Information section above the legend.

The Orthology Information describes the number of orthologous genes found and the number of species containing the orthologs.

The user can look at the detailed information about the orthologs by clicking on the link in the Othology Information section or can scroll down to get all the results for all the nodes.


Orthology Table

The table describes all the species and the corresponding orthologs. In case the node has further branching with orthologus genes, the species from the lower nodes are also shown. For example in the table below mammals have two branches, Primates and Rodents. The orthologs from these nodes are also reported in the table.


Protein-Protein Interactions

The network is displayed using Cytoscape Web v1.0.3.


Network visualization

On the left, the first-level network for the protein encoded by the gene of interest (which is in the center of the image) is displayed. On the right, information on the nodes and edges is displayed. It can be changed by clicking on the edges and nodes. By default, the node and edge information section displays information for the gene of interest.


The first-level network for the gene of interest (which is in the centre of the image) is displayed. Primary interactions, i.e. the interaction between the gene of interest and other genes, are coloured in green. By default, only primary interactions are displayed. To show secondary interactions (i.e. interactions of an interactor of the protein of interest), click on the interactor and then on “Show interactions” in the section Node and edge information on the right. Secondary interactions are shown in purple.


The thickness of the interaction lines is based on the number of experiments which support the interaction:

  • The thinner lines represent interactions found in one experiment (supported by only 1 publication).
  • The thicker lines represent interactions found in more than one experiment (more than one publication).

The colour of the gene name represents the number of duplicates:

  • Singleton genes are coloured in black.
  • Duplicated genes are coloured in red.

The shape of the nodes denotes the category of the gene:

  • Triangles represent known cancer genes.
  • Diamonds represent candidate cancer genes.
  • Circles represent genes that are not associated with cancer.

The colour of the node defines the origin of the gene:

  • Cyan represents young genes originated in Metazoans, Vertebrates, Mammals or Primates.
  • Blue represents old genes originated in Last Universal Common Ancestor, Eukaryotes or Opisthokonts.

In the example above, PRF1 is a singleton gene (name displayed in black) whose encoded protein interacts with 7 other proteins. One of them, KRT31, is encoded by a duplicated gene (red font). PRF1 and CALR are candidate cancer genes (diamond shape), the other interactors are non-cancer genes (circle). Six interactions are supported by only one publication (thin edges), one by >1 publication (thick edge). PRF1 is a recent gene (represented in cyan). Upon clicking on SRGN and “Show 18 interactions”, the interactors of SRGN become visible in purple. One protein, GZMB, interacts with both PRF1 and SRGN.

In the node and edge information, the degree, betweenness and clustering coefficient are displayed for the selected protein (default: protein of interest). The degree is defined as the number of interactions with other proteins. The betweenness measures the number of times the protein lies on the shortest path between two other proteins. The clustering coefficient is a measurement for how connected the interactors of the protein are between each other. It is defined as the number of observed interactions between interactors divided by the number of possible interactions between interactors.


Network table

The table lists the various properties of the human genes encoding proteins that the protein of interest interacts with. The properties are:

  • Cancer Gene: Information on whether the gene is a known cancer gene or a candidate cancer gene present in any cancer study.
  • Duplicated: States whether the gene has duplicates in the human genome.
  • Origin: The deepest taxonomic branch of the tree of life where an ortholog can be detected.
  • Degree: The number of interactions with other proteins.
  • Betweenness: Describes how many times the protein of interest lies on the shortest path between two other proteins.
  • Clustering Coefficient: Measures how connected the interactors of the protein of interest are between each other. It is defined as the number of observed interactions between interactors devided by the number of possible interactions between interactors.
  • PubMed ID(s) supporting the interaction: Publications supporting this interaction.

Complex table

The table lists the complexes which the protein of interest is a part of.

  • Complex: This is the name provided for the complex by the original database indicated (CORUM, HPRD or Reactome). Note that the same complex can listed more than once since it is named differently by different databases.
  • Components: Human proteins involved in this complex. Note that there might be other components of the complex which are not proteins. To find these, please refer to the link in the Complex column. In case 2 different complexes have the same components, they may be composed of the same proteins, but non-protein components may differ.
  • PubMed ID(s) supporting the complex: Publications supporting this complex.


miRNA-Gene interactions

The network of miRNA-target interaction is composed of cancer genes and the miRNAs targeting them. The network displays only interactions that are supported by experimental validations. The miRNA data are derived from (miRecords v.4.0) (Xiao F et al., 2009) and (miRTarBase v.7.0) (Chou et al., 2018).


Network visualization

The shape of the nodes denotes the category of the gene:

  • Triangles represent known cancer genes.
  • Diamonds represent candidate cancer genes.
  • Circles represent genes that are not associated with cancer.

The colour of the gene name represents the number of duplicates:

  • Singleton genes are coloured in black.
  • Duplicated genes are coloured in red.

The colour of the node defines the origin of the gene:

  • Cyan represents young genes originated in Metazoans, Vertebrates, Mammals or Primates.
  • Blue represents old genes originated in Last Universal Common Ancestor, Eukaryotes or Opisthokonts.

The thickness of the interaction lines is based on the number of experiments which support the interaction:

  • Thin lines represent interactions supported by a single publication.
  • Thick lines represent interactions supported by more than one publication.

Clicking on the miRNA or the gene node provides the information of the miRNA or the gene.


Network table

The Table includes all miRNAs and target genes visualized in the network. Each row provides information on the the target gene: involvement in cancer, evolutionary origin, and duplicability. The Pubmed IDs column contains the links to the publication supportingn the interaction, while the last column describes the methods employed to experimentally validate the interaction.


Protein Function

This page displays functional information from KEGG v85.1, Reactome v63, and BioCarta (downloaded from CGAP) for the gene of interest.

KEGG is a three-level hierarchical database of biological pathways. This table lists the lowest-level pathways to which the gene belongs ('Description'); links to the corresponding pathway maps ('ID'); and higher-level functional information ('Level A', 'Level B').

Reactome is a multi-level hierarchical database of biological pathways. This table lists the pathways at level two or greater to which the gene belongs ('Description'); the levels of these pathways ('Level'); links to the relevant section of the Pathway Browser ('ID'); and the corresponding level one pathways ('Branch').

BioCarta is a non-hierarchical database of biological pathways. This table lists the pathways to which the gene belongs ('Description') and links to the corresponding pathway diagrams ('ID').


 

Gene Expression in Normal Tissues

Expression levels are derived from two sources:

  1. GTEx provides RNA-seq expression data from 11,688 samples from 714 individuals, classified into 30 tissues. We report the TPM values from GTEx v7. Genes are considered to be expressed in a tissue if they have a TPM value of at least 1.
  2. The Human Protein Atlas version 18 provides RNA-seq data (HPA dataset) as average (mean) TPM values across samples in each of 37 tissue types. We use Protein Atlas' recommended threshold of 1TPM to determine categorical expression.

The results are plotted separately for each experiment:

Expression GTEx Expression ProteinAtlas


Protein Expression in Normal Tissues

The Human Protein Atlas version 18 provides protein expression data from immunohistochemistry assays in normal tissue samples. Expression is reported as Not detected, Low, Medium, or High in 44 tissue types.

The results are shown in a column chart:

Expression ProteinAtlas


Gene Expression in Cancer Cell Lines

Expression levels in cancer cell lines are derived from three sources:

  1. The Cancer Cell Line Encyclopedia (CCLE) provides RNA-Seq data from 1,048 cancer cell lines in 26 tissues. We report the RPKM values from the 14th February 2018 version.
  2. The COSMIC Cell Lines Project (CLP) provides Affymetrix U219 expression data from 970 cancer cell lines in 30 tissues. We report the normalized z-scores and the classification into over, normally, and under-expressed genes from COSMIC v84.
  3. The Genentech Cell Lines dataset (Klijn, 2015) provides RNA-seq expression data from 675 cancer cell lines in 31 tissues. We report the RPKM values downloaded from ArrayExpress.

The results are shown separately for each experiment:

Expression Cancer Cell Lines


Download

This page allows users to download information from the NCG database. There are three files available to download.

The first downloadable file is a list of cancer genes and their supporting literature. There is one row per gene-screen pair. The columns contain:

  • Gene Entrez ID
  • Gene symbol
  • PubMed ID
  • Screen type (e.g. WES, WGS)
  • Primary site
  • Cancer type
  • Method by which the gene was identified.

The second downloadable file is a list of cancer genes and their systems-level properties. There is one row per gene. The columns contain:

  • Gene Entrez ID
  • Gene symbol
  • Number of duplicated loci at 60% coverage
  • Evolutionary origin (taxonomic group)
  • Percentage of cell lines in which the gene was found to be essential
  • Number of tissues in which the gene is expressed according to RNA-Seq data (out of 43)
  • Number of tissues in which the protein is expressed according to immunohistochemistry data (out of 44)
  • Degree of the gene in the protein-protein interaction network (PPIN)
  • Betweenness in the PPIN
  • Clustering coefficient in the PPIN
  • Number of complexes in which the protein is a component
  • Number of targeting miRNAs
  • Protein function (pathways) from Level 1 of Reactome, separated by '|'
  • Pathways from Reactome below Level 1
  • Pathways from Level 1 of KEGG
  • Pathways from Level 2 of KEGG
  • Pathways from KEGG below Level 2
  • Pathways from BioCarta.

The third downloadable file is the same as the second, but with additional columns containing the full expression data taken from GTEx and Protein Atlas.

The fourth downloadable file is a list of known cancer genes and their annotations as tumour suppressor genes/oncogenes in the Cancer Gene Census (CGC) and the list from Vogelstein et al. There is one row per gene. The columns contain:

  • Gene Entrez ID
  • Gene symbol
  • Annotation in CGC
  • Annotation in Vogelstein
  • Boolean indicating whether the gene is considered an oncogene in NCG6 (i.e. no conflicting annotations)
  • Boolean indicating whether the gene is considered a tumour suppressor gene in NCG6.

The last two downloadable files are lists of candidate cancer genes with strong support, and a description of this support. There are two lists of well-supported candidates: candidates that were reported by more than one mutational screening within the same primary site (or more than one pan-cancer screening); and candidates that were reported by a study involving at least 140 cancer donors (top 25% of studies by sample size). There is one row per gene. The columns contain:

  • Gene Entrez ID
  • Gene symbol
  • Support type, i.e. which list the gene belongs to (note that genes in both lists are in two rows)
  • Support details. For genes found by multiple screenings within the same primary site, the sites and numbers of screenings are given. For genes found by studies with at least 140 donors, the largest number of donors is given.
  • PubMed IDs of the associated screenings (either within the listed primary sites or with the given number of donors).


 

Essentiality

This table provides information on the essentiality of the gene for cell survival. It lists whether the gene has been found essential or not in the respective human cell line and tissue in several screens obtained from the OGEE and PICKLES databases.

  • Essentiality: The score can either be Essential or Not_essential for each cell line or tissue in each screen. The cell lines in which the gene is essential are listed at the top. In OGEE, an essentiality score is provided directly for each gene. In PICKLES, Bayes Factors are provided which indicate the likelihood of a gene being essential. A gene is listed as essential according to PICKLES if the Bayes Factor is greater than 3, following the guideline of the PICKLES database.
  • Cell line: This column contains the cell line in which the gene has been tested for essentiality. When a screen used several cell lines of the same tissue to derive the essentiality score, this column contains a dash.
  • Tissue: This column contains the tissue in which the gene has been tested for essentiality.
  • Source: The database which was used to obtain the information.