Skip to content

Methane oxidation case study fails to train via "dp train methane_param.json" #3

@damonturney

Description

@damonturney

Installation of deepmd went smoothly via
conda create -n deepmd deepmd-kit lammps horovod -c conda-forge

Download of the methane oxidation case study dataset from github.com/tongzhugroup/Chapter13-tutorial went smoothly via
git clone https://github.com/tongzhugroup/Chapter13-tutorial

But "dp train methane_param.json > logfile 2> logfile" results in errors in the 2 or 3 times I've tried it.
Here's the final 100 lines of the log file:

[2025-05-26 20:28:09,140] DEEPMD INFO batch 14600: total wall time = 7.12 s
[2025-05-26 20:28:16,083] DEEPMD INFO batch 14700: trn: rmse = 8.38e+00, rmse_e = 1.01e-01, rmse_f = 2.70e-01, lr = 9.64e-04
[2025-05-26 20:28:16,084] DEEPMD INFO batch 14700: total wall time = 6.94 s
[2025-05-26 20:28:23,089] DEEPMD INFO batch 14800: trn: rmse = 9.34e+00, rmse_e = 1.18e-01, rmse_f = 3.01e-01, lr = 9.63e-04
[2025-05-26 20:28:23,089] DEEPMD INFO batch 14800: total wall time = 7.01 s
[2025-05-26 20:28:30,309] DEEPMD INFO batch 14900: trn: rmse = 2.54e+01, rmse_e = 3.58e-01, rmse_f = 8.18e-01, lr = 9.63e-04
[2025-05-26 20:28:30,309] DEEPMD INFO batch 14900: total wall time = 7.22 s
[2025-05-26 20:28:37,339] DEEPMD INFO batch 15000: trn: rmse = 2.64e+01, rmse_e = 2.52e-01, rmse_f = 8.51e-01, lr = 9.63e-04
[2025-05-26 20:28:37,339] DEEPMD INFO batch 15000: total wall time = 7.03 s
[2025-05-26 20:28:37,433] DEEPMD INFO saved checkpoint model.ckpt
[2025-05-26 20:28:44,569] DEEPMD INFO batch 15100: trn: rmse = 7.51e+00, rmse_e = 1.53e-01, rmse_f = 2.42e-01, lr = 9.63e-04
[2025-05-26 20:28:44,569] DEEPMD INFO batch 15100: total wall time = 7.23 s
2025-05-26 20:28:49.670824: F external/local_xla/xla/tsl/lib/monitoring/counter.h:205] Check failed: 0 <= step (0 vs. -31781)Must not decrement cumulative metrics.
[Mac:07228] *** Process received signal ***
[Mac:07228] Signal: Abort trap: 6 (6)
[Mac:07228] Signal code: (0)
[Mac:07228] [ 0] 0 libsystem_platform.dylib 0x000000018f053624 _sigtramp + 56
[Mac:07228] [ 1] 0 libsystem_pthread.dylib 0x000000018f01988c pthread_kill + 296
[Mac:07228] [ 2] 0 libsystem_c.dylib 0x000000018ef22c60 abort + 124
[Mac:07228] [ 3] 0 libtensorflow_framework.2.dylib 0x00000001117ea3a8 _ZN3tsl8internal15LogMessageFatalD2Ev + 36
[Mac:07228] [ 4] 0 libtensorflow_framework.2.dylib 0x00000001117ea3c4 _ZTv0_n24_N3tsl8internal15LogMessageFatalD1Ev + 0
[Mac:07228] [ 5] 0 libtensorflow_framework.2.dylib 0x00000001112e822c _ZN10tensorflow7metrics19UpdateGraphExecTimeEy + 284
[Mac:07228] [ 6] 0 libtensorflow_cc.2.dylib 0x000000030e2b764c _ZN10tensorflow13DirectSession11RunInternalExRKNS_10RunOptionsEPNS_18CallFrameInterfaceEPNS0_16ExecutorsAndKeysEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE + 4044
[Mac:07228] [ 7] 0 libtensorflow_cc.2.dylib 0x000000030e2b84cc _ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKNSt3__16vectorINS4_4pairINS4_12basic_stringIcNS4_11char_traitsIcEENS4_9allocatorIcEEEENS_6TensorEEENSA_ISE_EEEERKNS5_ISC_NSA_ISC_EEEESM_PNS5_ISD_NSA_ISD_EEEEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE + 1280
[Mac:07228] [ 8] 0 libtensorflow_cc.2.dylib 0x000000030e2b7fa4 _ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKNSt3__16vectorINS4_4pairINS4_12basic_stringIcNS4_11char_traitsIcEENS4_9allocatorIcEEEENS_6TensorEEENSA_ISE_EEEERKNS5_ISC_NSA_ISC_EEEESM_PNS5_ISD_NSA_ISD_EEEEPNS_11RunMetadataE + 48
[Mac:07228] [ 9] 0 _pywrap_tensorflow_internal.so 0x000000010c2fecd0 _ZN10tensorflow10SessionRef3RunERKNS_10RunOptionsERKNSt3__16vectorINS4_4pairINS4_12basic_stringIcNS4_11char_traitsIcEENS4_9allocatorIcEEEENS_6TensorEEENSA_ISE_EEEERKNS5_ISC_NSA_ISC_EEEESM_PNS5_ISD_NSA_ISD_EEEEPNS_11RunMetadataE + 316
[Mac:07228] [10] 0 libtensorflow_cc.2.dylib 0x0000000302089d5c _ZL13TF_Run_HelperPN10tensorflow7SessionEPKcPK9TF_BufferRKNSt3__16vectorINS7_4pairINS7_12basic_stringIcNS7_11char_traitsIcEENS7_9allocatorIcEEEENS_6TensorEEENSD_ISH_EEEERKNS8_ISF_NSD_ISF_EEEEPP9TF_TensorSP_PS4_P10TSL_Status + 1508
[Mac:07228] [11] 0 libtensorflow_cc.2.dylib 0x0000000302094f84 TF_SessionRun + 908
[Mac:07228] [12] 0 _pywrap_tensorflow_internal.so 0x000000010c2fbfd4 ZN10tensorflow28TF_SessionRun_wrapper_helperEP10TF_SessionPKcPK9TF_BufferRKNSt3__16vectorI9TF_OutputNS7_9allocatorIS9_EEEERKNS8_IP7_objectNSA_ISG_EEEESE_RKNS8_IP12TF_OperationNSA_ISM_EEEEPS4_P10TSL_StatusPSI + 1132
[Mac:07228] [13] 0 _pywrap_tensorflow_internal.so 0x000000010c2fc8ec ZN10tensorflow21TF_SessionRun_wrapperEP10TF_SessionPK9TF_BufferRKNSt3__16vectorI9TF_OutputNS5_9allocatorIS7_EEEERKNS6_IP7_objectNS8_ISE_EEEESC_RKNS6_IP12TF_OperationNS8_ISK_EEEEPS2_P10TSL_StatusPSG + 56
[Mac:07228] [14] 0 _pywrap_tf_session.so 0x0000000116067080 _ZNO8pybind116detail15argument_loaderIJP10TF_SessionP9TF_BufferRKNS_6handleERKNSt3__16vectorI9TF_OutputNS9_9allocatorISB_EEEERKNSA_IP12TF_OperationNSC_ISI_EEEES5_EE4callINS_6objectENS0_9void_typeERZL32pybind11_init__pywrap_tf_sessionRNS_7module_EE4$46EENS9_9enable_ifIXntsr3std7is_voidIT_EE5valueESW_E4typeEOT1 + 728
[Mac:07228] [15] 0 _pywrap_tf_session.so 0x0000000116066c80 _ZZN8pybind1112cpp_function10initializeIZL32pybind11_init__pywrap_tf_sessionRNS_7module_EE4$46NS_6objectEJP10TF_SessionP9TF_BufferRKNS_6handleERKNSt3__16vectorI9TF_OutputNSD_9allocatorISF_EEEERKNSE_IP12TF_OperationNSG_ISM_EEEES9_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE_8__invokeES17 + 172
[Mac:07228] [16] 0 _pywrap_tf_session.so 0x000000011601e160 ZN8pybind1112cpp_function10dispatcherEP7_objectS2_S2 + 4508
[Mac:07228] [17] 0 python3.11 0x0000000100422938 cfunction_call + 124
[Mac:07228] [18] 0 python3.11 0x00000001003cabc0 _PyObject_MakeTpCall + 332
[Mac:07228] [19] 0 python3.11 0x00000001004ce228 _PyEval_EvalFrameDefault + 45376
[Mac:07228] [20] 0 python3.11 0x00000001004d2708 _PyEval_Vector + 184
[Mac:07228] [21] 0 python3.11 0x00000001004d005c _PyEval_EvalFrameDefault + 53108
[Mac:07228] [22] 0 python3.11 0x00000001004d2708 _PyEval_Vector + 184
[Mac:07228] [23] 0 python3.11 0x00000001003ce6a8 method_vectorcall + 172
[Mac:07228] [24] 0 python3.11 0x00000001003cb358 _PyVectorcall_Call + 132
[Mac:07228] [25] 0 python3.11 0x00000001004d005c _PyEval_EvalFrameDefault + 53108
[Mac:07228] [26] 0 python3.11 0x00000001004d2708 _PyEval_Vector + 184
[Mac:07228] [27] 0 python3.11 0x00000001003cb358 _PyVectorcall_Call + 132
[Mac:07228] [28] 0 python3.11 0x00000001004d005c _PyEval_EvalFrameDefault + 53108
[Mac:07228] [29] 0 python3.11 0x00000001004c210c PyEval_EvalCode + 204
[Mac:07228] *** End of error message ***

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions