This repository contains a bioinformatic tool depevoped to predict the structur of the IGH locus of any desired genome. It is designed to operate using commonly used alignment tools, as well as particular-designed analysis scripts. The information produced is used to predict the location of IGH-V, -D, -J, -C segments within the IGH locus.
This section describes the general steps performed by AIRLoM, in order to have a general overview of the pipeline.
- Variables confirmation
- Make CDHIT reductions.
- Align with CLUSTALW2 the RSS reductions.
- Make BLAST analysis.
- Convert BLAST m6 table to gff.
- Extract scaffolds of interest.
- Make EXONERATE analysis.
- Clean vulgar format.
- Make HMMER analysis for RSS.
- Filter EXONERATE files for exons and genes hits.
- Reduce to eliminate redundancy in filtered EXONERATE files.
- Make overlap analysis to detect V segments with exon and SP
- Make MINIPROT analysis.
- Correct V segments and RSS-J coordinates.
- Predict D segments based in founded RSS-D.
AIRLoM is controled by a master script, structured in a series of functions located in the fun.sh script. This modulation was made to have a better control of every step of the analysis, and even exclude parts of it.
Subscript were made in order to format the results obtained, perform reduction of coordinates redundancy, and make overlap analysis.
This script transforms blast format 6 (tabular with 11 columns) into a gff file, along with some filters.
blastm6_to_gff.py --file [FILE] --source [SOURCE] --bitscore [BITSCORE]
Options | Description |
---|---|
file | The file produced by blast with output format 6 (tabular). |
source | The mode in which blast was performed, e.g. blastn, blastp, tblastx, ... |
bitscore | Bitscore obtained by blast. Used to add a level of filter to the result. If 0, all results will be maintained. |
This script takes the vulgar format from the exonerate analysis and transforms the vulgar syntax into a table of condensed results.
vulgar_to_table.R --file [FILE]
Options | Description |
---|---|
file | The filtered exonerate result file containing only the records with vulgar formats. |
This script transforms hmmer tbl format into a gff file.
hmmer_tbl_to_gff.py --file [FILE]
Options | Description |
---|---|
file | The file produced by HMMER analysis, in tbl format. |
This script takes the filtered exonerate files (genes or exons) and perform a reduction of the overlaping sequences in order to reduce ambiguity produced from matches located at the same coordinates. The result is one genomic range per overlapping individual ranges.
gff_disambiguation.R --file [FILE]
Options | Description |
---|---|
file | The files resul |
This script takes two filtered files, the reduced exons and genes records from EXONERATE results. In order to detect if a V segment is a has its structural exon and its signal peptide, at least two exons need to be detected in each gene record. To do this, the number of exons per gene is counted and every gene that has two or more exons are annotated as genes, whereas genes with one or less associated exons are annotated as pseudogenes. Every record is numbered in a unique fashion to give each segment a unique name.
SCRIPTS/SUBSCRIPTS/predict_ighv_by_overlaps.R --query [FILE] --subject [FILE]
Options | Description |
---|---|
query | The reduced gff file from filtered exon annotated records from EXONERATE |
subject | The reduced gff file from filtered gene annotated records from EXONERATE |
This script detect which RSS is locate nearby each V/J segment. This is achieved by increasing the coordinates of RSS segments in both the start and the end of the match, in order to try to make an artificial overlap with the nearby V/J segment
locate_nearby_rss.R -n [FILE] -t [FILE] -v [FILE] -r [INT] -m [STRING]
Options | Description |
---|---|
n | The HMMER gff file |
t | The HMMER tbl file |
v | The overlap gff file prediction containing the names of genes and pseudogenes |
r | The number of positions to increase in both sides of the HMMER matches, in which to detect possible overlaps with the V segments |
m | Mode. Depends on the input files. One of the followings: [V_segments/J_segments] |
This script detects the probable location of true D segments. To do this, the reasoning was that true D segments would by flanked by D RSS signal in both 5' and 3' ends. Every genomic region comprising both ends that have less tha 50 bp lenght is considered as a potential true D segment.
predict_ighd_by_rss.R -f [FILE] -t [FILE]
Options | Description |
---|---|
f | The HMMMER gff file produced with the 5' RSS database |
t | The HMMMER gff file produced with the 3' RSS database |
The script uses many dependencies that could be annoying to install, test, and put the required locations. To prevent this, we designed a docker image in order to have a clean and ready-to-use environment to perform the genomic analysis using this script.
This are all the dependencies and the versions used in order to make the script to make all the analysis. Here are considered only the programs that could not be preinstalled in a new linux installation.
Programming Language | Library, package | Use in | Version |
---|---|---|---|
Conda | --- | --- | --- |
python | pandas | --- | --- |
R | rtracklayer | --- | --- |
R | GenomicRanges | --- | --- |