-
Notifications
You must be signed in to change notification settings - Fork 739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No output vcf obtained from custom model #869
Comments
Hi! Thanks for the kind words. The issue seems to be that
Can you confirm that this file exists under |
@bioinformaticaomicalabs , yes it should contain some information like this: gsutil cat gs://deepvariant/models/DeepVariant/1.6.1/checkpoints/wgs/example_info.json
{"version": "1.6.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}⏎ |
Can you check if you have |
Thank you very much, that finally resolved the issue. What happens now is that after performing the evaluation (the training, test, and validation sets are the sequencing of the NA12878 sample from the Coriell Institute sequenced three times), I’m getting low recall and precision values. For indels, recall is 0.41 and precision is 0.24, and for SNPs, recall is 0.57 and precision is 0.72. I tried the model you provided for exome sequencing, but I didn’t achieve better results (which is why I decided to create my own model). However, with typical tools (like GATK HaplotypeCaller), I get much better results (indels with 0.8 recall and 0.62 precision, and SNPs with 0.89 recall and 0.97 precision). Do you have any idea why this might be happening and any advice on how to solve it? I really believe that using a variant caller trained with my data should yield better results than GATK HaplotypeCaller |
@bioinformaticaomicalabs , if you have a benchmarking bam file that you can share then I can try to run it on our default model and see if it matches your expectations. However, why training didn't work would require some more investigation on how exactly you are training and preparing your data. |
Hello, the bam file es 13 GB, do you know how can i send it to you? |
@bioinformaticaomicalabs if you have GCP then that would be the best way for us. Any public bucket that you currently use to share data works for us. If the data is uploaded to SRA then you can link it to me but it might take a bit longer in that case. However you prefer, please send it to [email protected] |
@bioinformaticaomicalabs , closing this due to no activity. If you think you need further help, please reopen or send an email. |
ISSUE
First of all, I found DeepVariant to be a very good and innovative tool. I'm considering including it in my exome analysis pipeline. I followed the tutorial (DeepVariant worked correctly with the Complete Genomics model), and I created my own model using Genome in a Bottle samples. To do this, I sequenced the same reference sample three times to use each BAM file for training, validation, and testing. I didn't encounter any errors during the model creation process, but when I tried to test it, the process got stuck at the call_variants step.
Setup
Steps to reproduce:
Command:
Create examples for trainning set
sudo docker run -v "${PWD}/input":"/input" -v "${PWD}/REF":"/ref" -v "${PWD}"/output:"/output" google/deepvariant:"1.6.1" make_examples --mode training --ref "/ref/GRCh38.p14.genome.fa" --reads "/input/26_r_groups.bam" --examples "/output/training_set.gz" --truth_variants "/ref/HG001_GRCh38_1_22_v4.2.1_benchmark_ROCHE.vcf.gz" --confident_regions "/ref/HG001_GRCh38_1_22_v4.2.1_benchmark_ROCHE.bed"
Create examples for validation set
sudo docker run -v "${PWD}/input":"/input" -v "${PWD}/REF":"/ref" -v "${PWD}"/output:"/output" google/deepvariant:"1.6.1" make_examples --mode training --ref "/ref/GRCh38.p14.genome.fa" --reads "/input/27_r_groups.bam" --examples "/output/validation_set.gz" --truth_variants "/ref/HG001_GRCh38_1_22_v4.2.1_benchmark_ROCHE.vcf.gz" --confident_regions "/ref/HG001_GRCh38_1_22_v4.2.1_benchmark_ROCHE.bed"
Trainning Shuffling
python3 scripts/shuffle_tfrecords_beam.py --input_pattern_list=output/training_set.gz --output_pattern_prefix="output/training_shuffled" --output_dataset_name="26" --output_dataset_config_pbtxt="output/training.pbtxt" --job_name=shuffle-tfrecords
Validation Shuffling
python3 scripts/shuffle_tfrecords_beam.py --input_pattern_list=output/validation_set.gz --output_pattern_prefix="output/validation_shuffled" --output_dataset_name="27" --output_dataset_config_pbtxt="output/validation.pbtxt" --job_name=shuffle-tfrecords
Model trainning
sudo docker run -v "${PWD}/input":"/input" -v "${PWD}/REF":"/ref" -v "${PWD}"/output:"/output" google/deepvariant:"1.6.1" train --config=/input/dv_config.py:base --config.train_dataset_pbtxt="/output/training.pbtxt" --config.tune_dataset_pbtxt="/output/validation.pbtxt" --config.num_epochs=10 --config.learning_rate=0.0001 --config.num_validation_examples=0 --strategy=mirrored --experiment_dir="/output/" --config.batch_size=512
Model test
sudo docker run -v "${PWD}/input":"/input" -v "${PWD}/REF":"/ref" -v "${PWD}"/output:"/output" google/deepvariant:"1.6.1" /opt/deepvariant/bin/run_deepvariant --model_type WES --customized_model "/output/checkpoints/ckpt-679" --ref "/ref/GRCh38.p14.genome.fa" --reads "/input/33_r_groups.bam" --output_vcf "/output/33.vcf.gz" --output_gvcf "/output/33.g.vcf.gz" -intermediate_results_dir "/output/intermediate_results_dir" --num_shards 10
Error trace:
` - I0822 07:51:54.272576 127450974123840 make_examples_core.py:301] Task 17/20: Writing example info to /output/intermediate_results_dir/make_examples.tfrecord-00017-of-00020.gz.example_info.json
I0822 07:51:54.272647 127450974123840 make_examples_core.py:2958] example_shape = [100, 221, 7]
I0822 07:51:54.272740 127450974123840 make_examples_core.py:2959] example_channels = [1, 2, 3, 4, 5, 6, 19]
I0822 07:51:54.272911 127450974123840 make_examples_core.py:301] Task 17/20: Found 17451 candidate variants
I0822 07:51:54.272932 127450974123840 make_examples_core.py:301] Task 17/20: Created 18817 examples
I0822 07:52:09.283522 133276175411008 make_examples_core.py:301] Task 8/20: Writing example info to /output/intermediate_results_dir/make_examples.tfrecord-00008-of-00020.gz.example_info.json
I0822 07:52:09.283617 133276175411008 make_examples_core.py:2958] example_shape = [100, 221, 7]
I0822 07:52:09.283712 133276175411008 make_examples_core.py:2959] example_channels = [1, 2, 3, 4, 5, 6, 19]
I0822 07:52:09.283882 133276175411008 make_examples_core.py:301] Task 8/20: Found 17371 candidate variants
I0822 07:52:09.283904 133276175411008 make_examples_core.py:301] Task 8/20: Created 18820 examples
real 34m15.728s
user 624m43.553s
sys 2m24.932s
***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "/output/intermediate_results_dir/call_variants_output.tfrecord.gz" --examples "/output/intermediate_results_dir/[email protected]" --checkpoint "/output/checkpoints/ckpt-679"
/usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:
TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).
For more information see: tensorflow/addons#2807
warnings.warn(
I0822 07:52:10.812179 127086447671104 call_variants.py:563] Total 1 writing processes started.
I0822 07:52:10.813103 127086447671104 dv_utils.py:370] From /output/intermediate_results_dir/make_examples.tfrecord-00000-of-00020.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
I0822 07:52:10.813141 127086447671104 call_variants.py:588] Shape of input examples: [100, 221, 7]
I0822 07:52:10.813338 127086447671104 call_variants.py:592] Use saved model: True
Traceback (most recent call last):
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 789, in
app.run(main)
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/absl_py/absl/app.py", line 312, in run
_run_main(main, args)
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/absl_py/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 768, in main
call_variants(
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 598, in call_variants
model_example_shape = dv_utils.get_shape_and_channels_from_json(
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/dv_utils.py", line 367, in get_shape_and_channels_from_json
example_info = json.load(f)
File "/usr/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Process ForkProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 454, in post_processing
item = output_queue.get(timeout=180)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 108, in get
raise Empty
_queue.Empty
real 3m2.335s
user 0m7.450s
sys 0m4.274s`
Does the quick start test work on your system?
Yes
The text was updated successfully, but these errors were encountered: