GeneCup: Mining gene relationships from PubMed using custom ontology

/Enhanced with AI LLM search!/

GeneCup automatically extracts information from PubMed and NHGRI-EBI GWAS catalog on the relationship of any gene with a custom list of keywords hierarchically organized into an ontology. The users create an ontology by identifying categories of concepts and a list of keywords for each concept.

As an example, we created an ontology for drug addiction related concepts over 300 of these keywords are organized into six categories:

names of abused drugs, e.g., opioids
terms describing addiction, e.g., relapse
key brain regions implicated in addiction, e.g., ventral striatum
neurotrasmission, e.g., dopaminergic
synaptic plasticity, e.g., long term potentiation
intracellular signaling, e.g., phosphorylation

Live searches are conducted through PubMed to get relevant PMIDs, which are then used to retrieve the abstracts from a local archive. The relationships are presented as an interactive cytoscape graph. The nodes can be moved around to better reveal the connections. Clicking on the links will bring up the corresponding sentences in a new browser window. Stress related sentences for addiction keywords are further classified into either systemic or cellular stress using a convolutional neural network.

Top addiction related genes for addiction ontology

extract gene symbol, alias and name from NCBI gene_info for taxid 9606.
search PubMed to get a count of these names/alias, with addiction keywords and drug name
sort the genes with top counts, retrieve the abstracts and extract sentences with the 1) symbols and alias and 2) one of the keywords. manually check if there are stop words need to be removed.
sort the genes based on the number of abstracts with useful sentences.
generate the final list, include symbol, alias, and name

Run a test server

You can use the guix.scm container to run genecup:

`guix build -L . genecup-gemini`/bin/genecup --port 4201

Note that the build includes minipubmed and punkt for testing!

Run a production server

Install local mirror of PubMed

Following the instruction provided by NCBI: https://www.nlm.nih.gov/dataguide/edirect/archive.html unpack the data in, for example, /export/PubMed/

Point environment variables to this dir and run in the local source tree:

env EDIRECT_LOCAL_ARCHIVE=/export3/PubMed/Source `guix build -L . genecup-gemini`/bin/genecup --port 4201

You can run from a proper container:

guix shell -L ~/guix-bioinformatics -L . -C -N -F genecup-gemini -- genecup --port 4201

Environment variables used:

EDIRECT_LOCLA_ARCHIVE: PubMed datadir (defaults to built-in minipubmed)
GENECUP_DATADIR: SQLITE DB directory (default .)
NLTK_DATA: punkt_tab directory (defaults to built-in nltk-punkt)
TMPDIR (default /tmp)

Gemini API credentials

For stress classification via Gemini, create a credentials file:

mkdir -p ~/.config/gemini
echo "YOUR_API_KEY_HERE" > ~/.config/gemini/credentials

The server reads the API key from ~/.config/gemini/credentials on startup.

Development

Mini PubMed for testing

For testing or code development, it is useful to have a small collection of PubMed abstracts in the same format as the local PubMed mirror. We provide 2473 abstracts that can be used to test four gene symbols (gria1, crhr1, drd2, and penk).

install edirect (make sure you refresh your shell after install so the PATH is updated)
unpack the minipubmed.tgz file
test the installation by running:

cd minipubmed
cat pmid.list |fetch-pubmed  -path PubMed/Archive/ >test.xml

You should see 2473 abstracts in the test.xml file.

NLTK tokens

You also need to fetch punkt.zip from https://www.nltk.org/nltk_data/

cd minipubmed
mkdir tokenizers
cd tokenizers
wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip
unzip punkt.zip

Source code

The source code and data are in a git repository: https://git.genenetwork.org/genecup/

Support

E-mail Pjotr Prins or Hao Chen.

License

GeneCup source code is published under the liberal free software MIT licence (aka expat license)

Cite

GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships by Gunturkun MH, Flashner E, Wang T, Mulligan MK, Williams RW, Prins P, and Chen H.

G3 (Bethesda). 2022 May 6;12(5):jkac059. doi: 10.1093/g3journal/jkac059. PMID: 35285473; PMCID: PMC9073678.

@article{GeneCup,
  pmid         = {35285473},
  author       = {Gunturkun, M. H. and Flashner, E. and Wang, T. and Mulligan, M. K. and Williams, R. W. and Prins, P. and Chen, H.},
  title        = {{GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships}},
  journal      = {G3 (Bethesda)},
  year         = {2022},
  doi          = {10.1093/g3journal/jkac059},
  url          = {http://www.ncbi.nlm.nih.gov/pubmed/35285473}
}

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
contrib/patches		contrib/patches
etc		etc
examples		examples
nlp		nlp
old		old
static		static
templates		templates
tests		tests
user		user
utility		utility
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
VERSION		VERSION
addiction.onto		addiction.onto
addiction_gwas.tsv		addiction_gwas.tsv
addiction_keywords.py		addiction_keywords.py
gene_synonyms.py		gene_synonyms.py
genecup_synthesis_prompt.txt		genecup_synthesis_prompt.txt
guix.scm		guix.scm
minipubmed.tgz		minipubmed.tgz
more_functions.py		more_functions.py
nltk		nltk
requirements.txt		requirements.txt
server.py		server.py
stop_words_addiction_gene_search.txt		stop_words_addiction_gene_search.txt
userspub.sqlite		userspub.sqlite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeneCup: Mining gene relationships from PubMed using custom ontology

Top addiction related genes for addiction ontology

Run a test server

Run a production server

Install local mirror of PubMed

Gemini API credentials

Development

Mini PubMed for testing

NLTK tokens

Source code

Support

License

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GeneCup: Mining gene relationships from PubMed using custom ontology

Top addiction related genes for addiction ontology

Run a test server

Run a production server

Install local mirror of PubMed

Gemini API credentials

Development

Mini PubMed for testing

NLTK tokens

Source code

Support

License

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages