Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow --reference-file to use compressed BGZIP files #347

Open
edsu7 opened this issue Apr 21, 2023 · 4 comments
Open

Allow --reference-file to use compressed BGZIP files #347

edsu7 opened this issue Apr 21, 2023 · 4 comments
Assignees
Labels
new-feature Request is a new feature

Comments

@edsu7
Copy link

edsu7 commented Apr 21, 2023

As per Linda's comment in #229 (comment):

The following doesn't work:

score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:10000-20000 --reference-file /reference/GRCh38_hla_decoy_ebv/GRCh38_hla_decoy_ebv.fa.gz

But the uncompressed version does:

score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:1000-20000 --reference-file /reference/GRCh38_hla_decoy_ebv/GRCh38_hla_decoy_ebv.fa

We suspect this is due to the 2.1.0 version of HTSJDK library.
https://github.com/samtools/htsjdk/releases/tag/2.16.0 makes mention of supporting compressed files.

Updating HTSJDK may resolve the issue

@edsu7 edsu7 added the new-feature Request is a new feature label Apr 21, 2023
@edsu7
Copy link
Author

edsu7 commented Apr 21, 2023

Found this dependabot issue as well, so we may have a branch underway just needs testing and merging
overture-stack/score#342

@dahiyaAD
Copy link

Link to PR - overture-stack/score#363

@edsu7
Copy link
Author

edsu7 commented May 2, 2023

Ran the following test - Fix looks great
Setup cmd:

docker run -d -it --name prod-score-client -e ACCESSTOKEN=${token}-e STORAGE_URL=https://api.platform.icgc-argo.org/storage-api -e METADATA_URL=https://api.platform.icgc-argo.org/storage-api --mount type=bind,source=/home/ubuntu/downloads/example_sarek/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/,target=/references overture/score:75ca0a3

Compress cmd:

docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:1000-20000 --reference-file /references/Homo_sapiens_assembly38.fasta.gz" | wc -l

Uncompressed:

docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:1000-20000 --reference-file /references/Homo_sapiens_assembly38.fasta" | wc -l

Both had the same output:

Running...Viewing...                                                            
Validating repository connection...
5782 

One minor issue was that since the upgrade the execution time has gone from:

real	0m32.943s
user	0m0.038s
sys	0m0.168s

to

real	4m58.370s
user	0m0.065s
sys	0m0.179s

cmd:

time docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:1000-20000 --reference-file /references/Homo_sapiens_assembly38.fasta"

@lindaxiang
Copy link

Tested with overture/score:75ca0a3, and have the same observation as what Edmund has documented above.
Other than execution time becomes very long, no other major issues are found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature Request is a new feature
Projects
None yet
Development

No branches or pull requests

3 participants