1.5.0 release (#575)

sunqm · web-flow · commit ce3d8f9ada77 · 2025-11-24T13:01:36.000-08:00
* 1.5.0 release

* Clean up unrolled integral kernels

* Add sm 80 and 120 real code for cuda13 release
diff --git a/.github/workflows/pypi_wheel.yml b/.github/workflows/pypi_wheel.yml
@@ -66,7 +66,7 @@ jobs:
     - name: Build wheels
       run: |
         docker run --rm -v ${{ github.workspace }}:/gpu4pyscf:rw --workdir=/gpu4pyscf \
-        -e CMAKE_CONFIGURE_ARGS="-DCUDA_ARCHITECTURES=80-virtual -DBUILD_LIBXC=OFF" \
+        -e CMAKE_CONFIGURE_ARGS="-DCUDA_ARCHITECTURES=80;120-real -DBUILD_LIBXC=OFF" \
         ${{ env.img }} \
         bash -exc 'sh /gpu4pyscf/builder/build_wheels.sh'
     - name: List available wheels
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,48 @@
+v1.5.0 (2025-11-24)
+-------------------
+* New Features (PBC systems)
+  - PBC GDF extended to k-mesh computations; k-point GDF integrals stored in host memory with compression.
+  - Multigrid algorithm supports PBC k-point SCF and band structure calculations.
+  - Add the .analyze() method for PBC gamma-point and k-mesh DFT to summarize results and charge populations.
+  - Fermi and Gaussian smearing for PBC and molecular DFT.
+  - PBC RSJK algorithm for J/K matrix evaluation (J via MD J-engine; K via Rys quadrature).
+  - Analytical nuclear gradients using RSJK and AFTDF for PBC gamma-point HF, k-mesh HF, and hybrid DFT.
+  - Stress tensor evaluation using RSJK and AFTDF for PBC gamma-point HF, k-mesh HF, and hybrid DFT.
+  - Geometry optimizer for PBC DFT.
+* New Features (molecular systems)
+  - Support for QMMM point charges and external electric fields.
+  - 3c2e integrals contracted with density matrices and auxiliary vectors for memory-efficient DF Coulomb matrix evaluation.
+  - DFT Hessian second-derivative grid response.
+  - Minimum energy crossing point (MECP) search functionality.
+  - PCM support for TDDFT derivative coupling calculations.
+  - Basic GKS and two-component numerical integration, including GPU-accelerated multi-collinear functionals.
+  - Multi-collinear spin-flip TDA/TDDFT excitation energies and analytical gradients.
+* Improvements (PBC systems)
+  - Linear-dependency handling for basis functions in molecular and PBC DFT calculations.
+  - Refactored PBC nuclear gradients for more efficient GTH pseudopotential evaluation.
+  - Faster GTH pseudopotential evaluation using the multigrid algorithm on large systems.
+* Improvements (molecular systems)
+  - Optimized DFT numerical integration memory usage, achieving ~20% performance gains.
+  - Refactored and optimized molecular four-center-integral J/K builder, achieving 50-100% speed-up.
+  - Improved phase determination method for NACV.
+  - More numerically stable Hessian integrals for large-exponent GTOs.
+  - MD J-engine optimized with reduced CUDA register pressure.
+  - Third-order XC derivatives can be evaluated on GPU (requires gpu4pyscf-libxc 0.7).
+  - Default auxbasis_response level increased to 2 for Hessian calculations with DF integrals.
+  - Dimension checks for eigh, enabling scipy fallback for large arrays (size > 21350).
+* Fixes
+  - Handle eps=inf in solvent models.
+  - Fixed an edge case in EDA electrostatics when cross-fragment nocc is 1.
+  - Fixed EDA crash caused by fragments accessing JK matrices after DF 3-index tensors were freed.
+  - Workaround for CUDA 13 compiler bugs affecting ECP kernels (disabling compilier optimizations)
+  - Molecular and PBC 3c2e integral dimension issues for generally contracted basis sets.
+  - Removed pre-allocated streams that caused inconsistent synchronization.
+  - SMD Hessian.
+  - UHF crash when level_shift is enabled.
+* API updates
+  - Fix to_gpu/to_cpu interface in SMD, TDDFT, and PCM-TDDFT
+  - Added from_cpu hook on the GPU side, allowing pyscf to invoke this hook in its to_gpu method.
+
 v1.4.3 (2025-08-20)
 -------------------
 * New Features
diff --git a/README.md b/README.md
@@ -19,6 +19,7 @@ Then, install the appropriate package based on your CUDA version:
 ----------------| --------------------------------------|----------------------------------|
 | **CUDA 11.x** |  ```pip3 install gpu4pyscf-cuda11x``` | ```pip3 install cutensor-cu11``` |
 | **CUDA 12.x** |  ```pip3 install gpu4pyscf-cuda12x``` | ```pip3 install cutensor-cu12``` |
+| **CUDA 13.x** |  ```pip3 install gpu4pyscf-cuda13x``` | ```pip3 install cutensor-cu13``` |
 
 The versions of CuPy and cuTENSOR are strongly interdependent and should not be combined arbitrarily.
 The recommended combinations include:
@@ -80,19 +81,17 @@ The following features are still in the experimental stage
 
 Limitations
 --------
-- Rys roots up to 9 for density fitting scheme and direct scf scheme;
 - Atomic basis up to g orbitals;
 - Auxiliary basis up to i orbitals;
 - Density fitting scheme up to ~168 atoms with def2-tzvpd basis, bounded by CPU memory;
-- meta-GGA without density laplacian;
+- meta-GGA with density laplacian;
 - Double hybrid functionals are not supported;
 - Hessian of TDDFT is not supported;
 
 Examples
 --------
 ```python
 import pyscf
-from gpu4pyscf.dft import rks
 
 atom ='''
 O       0.0000000000    -0.0000000000     0.1174000000
@@ -101,34 +100,16 @@ H       0.7570000000     0.0000000000    -0.4696000000
 '''
 
 mol = pyscf.M(atom=atom, basis='def2-tzvpp')
-mf = rks.RKS(mol, xc='LDA').density_fit()
+mf = rks.RKS(mol, xc='b3lyp').density_fit().to_gpu()  # move PySCF object to GPU4PySCF object
 
 e_dft = mf.kernel()  # compute total energy
 print(f"total energy = {e_dft}")
 
-g = mf.nuc_grad_method()
+g = mf.Gradients()
 g_dft = g.kernel()   # compute analytical gradient
 
 h = mf.Hessian()
 h_dft = h.kernel()   # compute analytical Hessian
-
-```
-
-`to_gpu` is supported since PySCF 2.5.0
-```python
-import pyscf
-from pyscf.dft import rks
-
-atom ='''
-O       0.0000000000    -0.0000000000     0.1174000000
-H      -0.7570000000    -0.0000000000    -0.4696000000
-H       0.7570000000     0.0000000000    -0.4696000000
-'''
-
-mol = pyscf.M(atom=atom, basis='def2-tzvpp')
-mf = rks.RKS(mol, xc='LDA').density_fit().to_gpu()  # move PySCF object to GPU4PySCF object
-e_dft = mf.kernel()  # compute total energy
-
 ```
 
 Find more examples in [gpu4pyscf/examples](https://github.com/pyscf/gpu4pyscf/tree/master/examples)
diff --git a/gpu4pyscf/__init__.py b/gpu4pyscf/__init__.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-__version__ = '1.4.3'
+__version__ = '1.5.0'
 
 from . import _patch_pyscf
 
diff --git a/gpu4pyscf/lib/gvhf-md/unrolled_md_j_4dm.cu b/gpu4pyscf/lib/gvhf-md/unrolled_md_j_4dm.cu
@@ -136,10 +136,10 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
         for (int batch_kl = 0; batch_kl < 21; ++batch_kl) {
             int task_kl0 = blockIdx.y * 336 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             if (pair_ij_mapping == pair_kl_mapping && task_ij0+16 <= task_kl0) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -159,7 +159,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
             int kl_loc0 = pair_kl_loc[task_kl];
             if (pair_ij_mapping == pair_kl_mapping) {
                 if (task_ij == task_kl) fac *= .5;
-                if (task_ij < task_kl) fac = 0.;
+                else if (task_ij < task_kl) fac = 0.;
             }
             __syncthreads();
             double xij = Rp_cache[tx+0];
@@ -443,7 +443,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
         for (int batch_kl = 0; batch_kl < 21; ++batch_kl) {
             int task_kl0 = blockIdx.y * 336 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -821,10 +821,10 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
         for (int batch_kl = 0; batch_kl < 6; ++batch_kl) {
             int task_kl0 = blockIdx.y * 96 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             if (pair_ij_mapping == pair_kl_mapping && task_ij0+16 <= task_kl0) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -844,7 +844,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
             int kl_loc0 = pair_kl_loc[task_kl];
             if (pair_ij_mapping == pair_kl_mapping) {
                 if (task_ij == task_kl) fac *= .5;
-                if (task_ij < task_kl) fac = 0.;
+                else if (task_ij < task_kl) fac = 0.;
             }
             __syncthreads();
             double xij = Rp_cache[tx+0];
@@ -1593,7 +1593,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
         for (int batch_kl = 0; batch_kl < 16; ++batch_kl) {
             int task_kl0 = blockIdx.y * 256 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -2127,7 +2127,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
         for (int batch_kl = 0; batch_kl < 10; ++batch_kl) {
             int task_kl0 = blockIdx.y * 160 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -3160,10 +3160,10 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
         for (int batch_kl = 0; batch_kl < 4; ++batch_kl) {
             int task_kl0 = blockIdx.y * 64 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             if (pair_ij_mapping == pair_kl_mapping && task_ij0+16 <= task_kl0) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -3183,7 +3183,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
             int kl_loc0 = pair_kl_loc[task_kl];
             if (pair_ij_mapping == pair_kl_mapping) {
                 if (task_ij == task_kl) fac *= .5;
-                if (task_ij < task_kl) fac = 0.;
+                else if (task_ij < task_kl) fac = 0.;
             }
             __syncthreads();
             double xij = Rp_cache[tx+0];
@@ -5337,7 +5337,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
         for (int batch_kl = 0; batch_kl < 21; ++batch_kl) {
             int task_kl0 = blockIdx.y * 336 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -5988,7 +5988,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
         for (int batch_kl = 0; batch_kl < 6; ++batch_kl) {
             int task_kl0 = blockIdx.y * 96 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -7884,7 +7884,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 2) {
         for (int batch_kl = 0; batch_kl < 24; ++batch_kl) {
             int task_kl0 = blockIdx.y * 384 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -8470,7 +8470,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 2) {
         for (int batch_kl = 0; batch_kl < 9; ++batch_kl) {
             int task_kl0 = blockIdx.y * 144 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -10111,7 +10111,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 2) {
         for (int batch_kl = 0; batch_kl < 12; ++batch_kl) {
             int task_kl0 = blockIdx.y * 192 + batch_kl * 16;
             if (task_kl0 >= npairs_kl) {
-                continue;
+                break;
             }
             int pair_ij0 = pair_ij_mapping[task_ij0];
             int pair_kl0 = pair_kl_mapping[task_kl0];
diff --git a/gpu4pyscf/lib/gvhf-rys/unrolled_ejk_ip2_type12.cu b/gpu4pyscf/lib/gvhf-rys/unrolled_ejk_ip2_type12.cu
diff --git a/gpu4pyscf/lib/gvhf-rys/unrolled_ejk_ip2_type3.cu b/gpu4pyscf/lib/gvhf-rys/unrolled_ejk_ip2_type3.cu