Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No output vcf obtained from custom model #869

Closed
bioinformaticaomicalabs opened this issue Aug 22, 2024 · 9 comments
Closed

No output vcf obtained from custom model #869

bioinformaticaomicalabs opened this issue Aug 22, 2024 · 9 comments

Comments

@bioinformaticaomicalabs

ISSUE
First of all, I found DeepVariant to be a very good and innovative tool. I'm considering including it in my exome analysis pipeline. I followed the tutorial (DeepVariant worked correctly with the Complete Genomics model), and I created my own model using Genome in a Bottle samples. To do this, I sequenced the same reference sample three times to use each BAM file for training, validation, and testing. I didn't encounter any errors during the model creation process, but when I tried to test it, the process got stuck at the call_variants step.

Setup

  • Operating system: Ubuntu 22.04.4 LTS
  • DeepVariant version:1.6.1
  • Installation method:docker
  • Type of data: MGI DNBSEQ 400, exome sequencing.

Steps to reproduce:

  • Command:
    Create examples for trainning set
    sudo docker run -v "${PWD}/input":"/input" -v "${PWD}/REF":"/ref" -v "${PWD}"/output:"/output" google/deepvariant:"1.6.1" make_examples --mode training --ref "/ref/GRCh38.p14.genome.fa" --reads "/input/26_r_groups.bam" --examples "/output/training_set.gz" --truth_variants "/ref/HG001_GRCh38_1_22_v4.2.1_benchmark_ROCHE.vcf.gz" --confident_regions "/ref/HG001_GRCh38_1_22_v4.2.1_benchmark_ROCHE.bed"
    Create examples for validation set
    sudo docker run -v "${PWD}/input":"/input" -v "${PWD}/REF":"/ref" -v "${PWD}"/output:"/output" google/deepvariant:"1.6.1" make_examples --mode training --ref "/ref/GRCh38.p14.genome.fa" --reads "/input/27_r_groups.bam" --examples "/output/validation_set.gz" --truth_variants "/ref/HG001_GRCh38_1_22_v4.2.1_benchmark_ROCHE.vcf.gz" --confident_regions "/ref/HG001_GRCh38_1_22_v4.2.1_benchmark_ROCHE.bed"
    Trainning Shuffling
    python3 scripts/shuffle_tfrecords_beam.py --input_pattern_list=output/training_set.gz --output_pattern_prefix="output/training_shuffled" --output_dataset_name="26" --output_dataset_config_pbtxt="output/training.pbtxt" --job_name=shuffle-tfrecords
    Validation Shuffling
    python3 scripts/shuffle_tfrecords_beam.py --input_pattern_list=output/validation_set.gz --output_pattern_prefix="output/validation_shuffled" --output_dataset_name="27" --output_dataset_config_pbtxt="output/validation.pbtxt" --job_name=shuffle-tfrecords
    Model trainning
    sudo docker run -v "${PWD}/input":"/input" -v "${PWD}/REF":"/ref" -v "${PWD}"/output:"/output" google/deepvariant:"1.6.1" train --config=/input/dv_config.py:base --config.train_dataset_pbtxt="/output/training.pbtxt" --config.tune_dataset_pbtxt="/output/validation.pbtxt" --config.num_epochs=10 --config.learning_rate=0.0001 --config.num_validation_examples=0 --strategy=mirrored --experiment_dir="/output/" --config.batch_size=512
    Model test
    sudo docker run -v "${PWD}/input":"/input" -v "${PWD}/REF":"/ref" -v "${PWD}"/output:"/output" google/deepvariant:"1.6.1" /opt/deepvariant/bin/run_deepvariant --model_type WES --customized_model "/output/checkpoints/ckpt-679" --ref "/ref/GRCh38.p14.genome.fa" --reads "/input/33_r_groups.bam" --output_vcf "/output/33.vcf.gz" --output_gvcf "/output/33.g.vcf.gz" -intermediate_results_dir "/output/intermediate_results_dir" --num_shards 10

  • Error trace:
    ` - I0822 07:51:54.272576 127450974123840 make_examples_core.py:301] Task 17/20: Writing example info to /output/intermediate_results_dir/make_examples.tfrecord-00017-of-00020.gz.example_info.json
    I0822 07:51:54.272647 127450974123840 make_examples_core.py:2958] example_shape = [100, 221, 7]
    I0822 07:51:54.272740 127450974123840 make_examples_core.py:2959] example_channels = [1, 2, 3, 4, 5, 6, 19]
    I0822 07:51:54.272911 127450974123840 make_examples_core.py:301] Task 17/20: Found 17451 candidate variants
    I0822 07:51:54.272932 127450974123840 make_examples_core.py:301] Task 17/20: Created 18817 examples
    I0822 07:52:09.283522 133276175411008 make_examples_core.py:301] Task 8/20: Writing example info to /output/intermediate_results_dir/make_examples.tfrecord-00008-of-00020.gz.example_info.json
    I0822 07:52:09.283617 133276175411008 make_examples_core.py:2958] example_shape = [100, 221, 7]
    I0822 07:52:09.283712 133276175411008 make_examples_core.py:2959] example_channels = [1, 2, 3, 4, 5, 6, 19]
    I0822 07:52:09.283882 133276175411008 make_examples_core.py:301] Task 8/20: Found 17371 candidate variants
    I0822 07:52:09.283904 133276175411008 make_examples_core.py:301] Task 8/20: Created 18820 examples

real 34m15.728s
user 624m43.553s
sys 2m24.932s

***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "/output/intermediate_results_dir/call_variants_output.tfrecord.gz" --examples "/output/intermediate_results_dir/[email protected]" --checkpoint "/output/checkpoints/ckpt-679"

/usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).

For more information see: tensorflow/addons#2807

warnings.warn(
I0822 07:52:10.812179 127086447671104 call_variants.py:563] Total 1 writing processes started.
I0822 07:52:10.813103 127086447671104 dv_utils.py:370] From /output/intermediate_results_dir/make_examples.tfrecord-00000-of-00020.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
I0822 07:52:10.813141 127086447671104 call_variants.py:588] Shape of input examples: [100, 221, 7]
I0822 07:52:10.813338 127086447671104 call_variants.py:592] Use saved model: True
Traceback (most recent call last):
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 789, in
app.run(main)
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/absl_py/absl/app.py", line 312, in run
_run_main(main, args)
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/absl_py/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 768, in main
call_variants(
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 598, in call_variants
model_example_shape = dv_utils.get_shape_and_channels_from_json(
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/dv_utils.py", line 367, in get_shape_and_channels_from_json
example_info = json.load(f)
File "/usr/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Process ForkProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 454, in post_processing
item = output_queue.get(timeout=180)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 108, in get
raise Empty
_queue.Empty

real 3m2.335s
user 0m7.450s
sys 0m4.274s`

Does the quick start test work on your system?
Yes

@lucasbrambrink
Copy link
Collaborator

Hi!

Thanks for the kind words. The issue seems to be that call_variants expects a JSON file with the checkpoint.

model_example_info_json = f'{checkpoint_path}/example_info.json'
model_example_shape = dv_utils.get_shape_and_channels_from_json(
    model_example_info_json
)

Can you confirm that this file exists under "/output/checkpoints/ckpt-679"? If it exists, could you paste the content of it here?

@bioinformaticaomicalabs
Copy link
Author

Hi!

Thanks for the kind words. The issue seems to be that call_variants expects a JSON file with the checkpoint.

model_example_info_json = f'{checkpoint_path}/example_info.json'
model_example_shape = dv_utils.get_shape_and_channels_from_json(
    model_example_info_json
)

Can you confirm that this file exists under "/output/checkpoints/ckpt-679"? If it exists, could you paste the content of it here?

Hello, first, thank you very much for your help. Yes, the example_info.json exist, but do not have any content. Should it contain some information?
Captura desde 2024-08-27 08-57-05

@kishwarshafin
Copy link
Collaborator

@bioinformaticaomicalabs , yes it should contain some information like this:

gsutil cat gs://deepvariant/models/DeepVariant/1.6.1/checkpoints/wgs/example_info.json
{"version": "1.6.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}⏎

@kishwarshafin
Copy link
Collaborator

Can you check if you have example_info.json output in your training data and validation data generation folders and if they are the same? If same then you can copy it to the directory and use it. The training loop is supposed to copy the example_info.json from training folder to the checkpoint output directory. Not sure if it was missing in your setup.

@bioinformaticaomicalabs
Copy link
Author

Thank you very much, that finally resolved the issue. What happens now is that after performing the evaluation (the training, test, and validation sets are the sequencing of the NA12878 sample from the Coriell Institute sequenced three times), I’m getting low recall and precision values. For indels, recall is 0.41 and precision is 0.24, and for SNPs, recall is 0.57 and precision is 0.72. I tried the model you provided for exome sequencing, but I didn’t achieve better results (which is why I decided to create my own model). However, with typical tools (like GATK HaplotypeCaller), I get much better results (indels with 0.8 recall and 0.62 precision, and SNPs with 0.89 recall and 0.97 precision). Do you have any idea why this might be happening and any advice on how to solve it? I really believe that using a variant caller trained with my data should yield better results than GATK HaplotypeCaller

@kishwarshafin
Copy link
Collaborator

@bioinformaticaomicalabs , if you have a benchmarking bam file that you can share then I can try to run it on our default model and see if it matches your expectations. However, why training didn't work would require some more investigation on how exactly you are training and preparing your data.

@bioinformaticaomicalabs
Copy link
Author

Hello, the bam file es 13 GB, do you know how can i send it to you?

@kishwarshafin
Copy link
Collaborator

@bioinformaticaomicalabs if you have GCP then that would be the best way for us. Any public bucket that you currently use to share data works for us. If the data is uploaded to SRA then you can link it to me but it might take a bit longer in that case. However you prefer, please send it to [email protected]

@kishwarshafin
Copy link
Collaborator

@bioinformaticaomicalabs , closing this due to no activity. If you think you need further help, please reopen or send an email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants