Skip to content

How to make a TxDb object for the T2T-CHM13v2.0 genome (telomere to telomere Human genome) #1

@hpages

Description

@hpages

[Moved from https://github.com/Bioconductor/GenomicFeatures/issues/65 on March 22, 2024]

Question: How to make a TxDb object for the T2T-CHM13v2.0 genome (telomere to telomere Human genome), a.k.a. the hs1 genome at UCSC.

Answer: Unfortunately, makeTxDbFromUCSC() doesn't support hs1 at the moment, so we're going to use the GFF file provided by NCBI.

  1. Download GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/

  2. Import the GFF file as a GRanges object:

    library(rtracklayer)
    
    ## Takes < 1 min, consumes about 7Gb of RAM
    gff <- import("GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz")
    
  3. Note that the sequence names in the GRanges object are RefSeq accessions:

    seqlevels(gff)
    #  [1] "NC_060925.1" "NC_060926.1" "NC_060927.1" "NC_060928.1" "NC_060929.1"
    #  [6] "NC_060930.1" "NC_060931.1" "NC_060932.1" "NC_060933.1" "NC_060934.1"
    # [11] "NC_060935.1" "NC_060936.1" "NC_060937.1" "NC_060938.1" "NC_060939.1"
    # [16] "NC_060940.1" "NC_060941.1" "NC_060942.1" "NC_060943.1" "NC_060944.1"
    # [21] "NC_060945.1" "NC_060946.1" "NC_060947.1" "NC_060948.1"
    

    Let's change them to the official chromosome names:

    library(GenomeInfoDb)
    chrominfo <- getChromInfoFromNCBI("T2T-CHM13v2.0")
    seqlevels(gff) <- setNames(chrominfo$SequenceName, chrominfo$RefSeqAccn)
    seqlevels(gff)
    #  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
    # [16] "16" "17" "18" "19" "20" "21" "22" "X"  "Y"  "MT"
    
  4. Add the complete sequence info to the GRanges object:

    seqinfo(gff) <- Seqinfo(genome="T2T-CHM13v2.0")
    seqinfo(gff)
    # Seqinfo object with 25 sequences (1 circular) from T2T-CHM13v2.0 genome:
    #   seqnames seqlengths isCircular        genome
    #   1         248387328      FALSE T2T-CHM13v2.0
    #   2         242696752      FALSE T2T-CHM13v2.0
    #   3         201105948      FALSE T2T-CHM13v2.0
    #   4         193574945      FALSE T2T-CHM13v2.0
    #   5         182045439      FALSE T2T-CHM13v2.0
    #   ...             ...        ...           ...
    #   21         45090682      FALSE T2T-CHM13v2.0
    #   22         51324926      FALSE T2T-CHM13v2.0
    #   X         154259566      FALSE T2T-CHM13v2.0
    #   Y          62460029      FALSE T2T-CHM13v2.0
    #   MT            16569       TRUE T2T-CHM13v2.0
    
  5. Use makeTxDbFromGRanges() to make a TxDb object from the GRanges object:

    library(txdbmaker)
    
    ## This will emit 3 warnings that can be ignored.
    txdb <- makeTxDbFromGRanges(gff, taxonomyId=9606)
    
    txdb
    # TxDb object:
    ## Db type: TxDb
    ## Supporting package: GenomicFeatures
    ## Genome: T2T-CHM13v2.0
    ## Organism: Homo sapiens
    ## Taxonomy ID: 9606
    ## Nb of transcripts: 188205
    ## Db created by: txdbmaker package from Bioconductor
    ## Creation time: 2024-03-22 16:56:52 -0700 (Fri, 22 Mar 2024)
    ## txdbmaker version at creation time: 0.99.7
    ## RSQLite version at creation time: 2.3.5
    ## DBSCHEMAVERSION: 1.2
    

Note that if you need the UCSC chromosome names instead of the NCBI ones, you can switch them with seqlevelsStyle():

seqlevelsStyle(txdb)
# [1] "NCBI"

seqlevelsStyle(txdb) <- "UCSC"

seqlevelsStyle(txdb)
# [1] "UCSC"

seqlevels(txdb)
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 

H.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions