Skip to content

Commit ce3d8f9

Browse files
authored
1.5.0 release (#575)
* 1.5.0 release * Clean up unrolled integral kernels * Add sm 80 and 120 real code for cuda13 release
1 parent 3231824 commit ce3d8f9

File tree

7 files changed

+446
-476
lines changed

7 files changed

+446
-476
lines changed

.github/workflows/pypi_wheel.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ jobs:
6666
- name: Build wheels
6767
run: |
6868
docker run --rm -v ${{ github.workspace }}:/gpu4pyscf:rw --workdir=/gpu4pyscf \
69-
-e CMAKE_CONFIGURE_ARGS="-DCUDA_ARCHITECTURES=80-virtual -DBUILD_LIBXC=OFF" \
69+
-e CMAKE_CONFIGURE_ARGS="-DCUDA_ARCHITECTURES=80;120-real -DBUILD_LIBXC=OFF" \
7070
${{ env.img }} \
7171
bash -exc 'sh /gpu4pyscf/builder/build_wheels.sh'
7272
- name: List available wheels

CHANGELOG

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,48 @@
1+
v1.5.0 (2025-11-24)
2+
-------------------
3+
* New Features (PBC systems)
4+
- PBC GDF extended to k-mesh computations; k-point GDF integrals stored in host memory with compression.
5+
- Multigrid algorithm supports PBC k-point SCF and band structure calculations.
6+
- Add the .analyze() method for PBC gamma-point and k-mesh DFT to summarize results and charge populations.
7+
- Fermi and Gaussian smearing for PBC and molecular DFT.
8+
- PBC RSJK algorithm for J/K matrix evaluation (J via MD J-engine; K via Rys quadrature).
9+
- Analytical nuclear gradients using RSJK and AFTDF for PBC gamma-point HF, k-mesh HF, and hybrid DFT.
10+
- Stress tensor evaluation using RSJK and AFTDF for PBC gamma-point HF, k-mesh HF, and hybrid DFT.
11+
- Geometry optimizer for PBC DFT.
12+
* New Features (molecular systems)
13+
- Support for QMMM point charges and external electric fields.
14+
- 3c2e integrals contracted with density matrices and auxiliary vectors for memory-efficient DF Coulomb matrix evaluation.
15+
- DFT Hessian second-derivative grid response.
16+
- Minimum energy crossing point (MECP) search functionality.
17+
- PCM support for TDDFT derivative coupling calculations.
18+
- Basic GKS and two-component numerical integration, including GPU-accelerated multi-collinear functionals.
19+
- Multi-collinear spin-flip TDA/TDDFT excitation energies and analytical gradients.
20+
* Improvements (PBC systems)
21+
- Linear-dependency handling for basis functions in molecular and PBC DFT calculations.
22+
- Refactored PBC nuclear gradients for more efficient GTH pseudopotential evaluation.
23+
- Faster GTH pseudopotential evaluation using the multigrid algorithm on large systems.
24+
* Improvements (molecular systems)
25+
- Optimized DFT numerical integration memory usage, achieving ~20% performance gains.
26+
- Refactored and optimized molecular four-center-integral J/K builder, achieving 50-100% speed-up.
27+
- Improved phase determination method for NACV.
28+
- More numerically stable Hessian integrals for large-exponent GTOs.
29+
- MD J-engine optimized with reduced CUDA register pressure.
30+
- Third-order XC derivatives can be evaluated on GPU (requires gpu4pyscf-libxc 0.7).
31+
- Default auxbasis_response level increased to 2 for Hessian calculations with DF integrals.
32+
- Dimension checks for eigh, enabling scipy fallback for large arrays (size > 21350).
33+
* Fixes
34+
- Handle eps=inf in solvent models.
35+
- Fixed an edge case in EDA electrostatics when cross-fragment nocc is 1.
36+
- Fixed EDA crash caused by fragments accessing JK matrices after DF 3-index tensors were freed.
37+
- Workaround for CUDA 13 compiler bugs affecting ECP kernels (disabling compilier optimizations)
38+
- Molecular and PBC 3c2e integral dimension issues for generally contracted basis sets.
39+
- Removed pre-allocated streams that caused inconsistent synchronization.
40+
- SMD Hessian.
41+
- UHF crash when level_shift is enabled.
42+
* API updates
43+
- Fix to_gpu/to_cpu interface in SMD, TDDFT, and PCM-TDDFT
44+
- Added from_cpu hook on the GPU side, allowing pyscf to invoke this hook in its to_gpu method.
45+
146
v1.4.3 (2025-08-20)
247
-------------------
348
* New Features

README.md

Lines changed: 4 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ Then, install the appropriate package based on your CUDA version:
1919
----------------| --------------------------------------|----------------------------------|
2020
| **CUDA 11.x** | ```pip3 install gpu4pyscf-cuda11x``` | ```pip3 install cutensor-cu11``` |
2121
| **CUDA 12.x** | ```pip3 install gpu4pyscf-cuda12x``` | ```pip3 install cutensor-cu12``` |
22+
| **CUDA 13.x** | ```pip3 install gpu4pyscf-cuda13x``` | ```pip3 install cutensor-cu13``` |
2223

2324
The versions of CuPy and cuTENSOR are strongly interdependent and should not be combined arbitrarily.
2425
The recommended combinations include:
@@ -80,19 +81,17 @@ The following features are still in the experimental stage
8081

8182
Limitations
8283
--------
83-
- Rys roots up to 9 for density fitting scheme and direct scf scheme;
8484
- Atomic basis up to g orbitals;
8585
- Auxiliary basis up to i orbitals;
8686
- Density fitting scheme up to ~168 atoms with def2-tzvpd basis, bounded by CPU memory;
87-
- meta-GGA without density laplacian;
87+
- meta-GGA with density laplacian;
8888
- Double hybrid functionals are not supported;
8989
- Hessian of TDDFT is not supported;
9090

9191
Examples
9292
--------
9393
```python
9494
import pyscf
95-
from gpu4pyscf.dft import rks
9695

9796
atom ='''
9897
O 0.0000000000 -0.0000000000 0.1174000000
@@ -101,34 +100,16 @@ H 0.7570000000 0.0000000000 -0.4696000000
101100
'''
102101

103102
mol = pyscf.M(atom=atom, basis='def2-tzvpp')
104-
mf = rks.RKS(mol, xc='LDA').density_fit()
103+
mf = rks.RKS(mol, xc='b3lyp').density_fit().to_gpu() # move PySCF object to GPU4PySCF object
105104

106105
e_dft = mf.kernel() # compute total energy
107106
print(f"total energy = {e_dft}")
108107

109-
g = mf.nuc_grad_method()
108+
g = mf.Gradients()
110109
g_dft = g.kernel() # compute analytical gradient
111110

112111
h = mf.Hessian()
113112
h_dft = h.kernel() # compute analytical Hessian
114-
115-
```
116-
117-
`to_gpu` is supported since PySCF 2.5.0
118-
```python
119-
import pyscf
120-
from pyscf.dft import rks
121-
122-
atom ='''
123-
O 0.0000000000 -0.0000000000 0.1174000000
124-
H -0.7570000000 -0.0000000000 -0.4696000000
125-
H 0.7570000000 0.0000000000 -0.4696000000
126-
'''
127-
128-
mol = pyscf.M(atom=atom, basis='def2-tzvpp')
129-
mf = rks.RKS(mol, xc='LDA').density_fit().to_gpu() # move PySCF object to GPU4PySCF object
130-
e_dft = mf.kernel() # compute total energy
131-
132113
```
133114

134115
Find more examples in [gpu4pyscf/examples](https://github.com/pyscf/gpu4pyscf/tree/master/examples)

gpu4pyscf/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
__version__ = '1.4.3'
15+
__version__ = '1.5.0'
1616

1717
from . import _patch_pyscf
1818

gpu4pyscf/lib/gvhf-md/unrolled_md_j_4dm.cu

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -136,10 +136,10 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
136136
for (int batch_kl = 0; batch_kl < 21; ++batch_kl) {
137137
int task_kl0 = blockIdx.y * 336 + batch_kl * 16;
138138
if (task_kl0 >= npairs_kl) {
139-
continue;
139+
break;
140140
}
141141
if (pair_ij_mapping == pair_kl_mapping && task_ij0+16 <= task_kl0) {
142-
continue;
142+
break;
143143
}
144144
int pair_ij0 = pair_ij_mapping[task_ij0];
145145
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -159,7 +159,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
159159
int kl_loc0 = pair_kl_loc[task_kl];
160160
if (pair_ij_mapping == pair_kl_mapping) {
161161
if (task_ij == task_kl) fac *= .5;
162-
if (task_ij < task_kl) fac = 0.;
162+
else if (task_ij < task_kl) fac = 0.;
163163
}
164164
__syncthreads();
165165
double xij = Rp_cache[tx+0];
@@ -443,7 +443,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
443443
for (int batch_kl = 0; batch_kl < 21; ++batch_kl) {
444444
int task_kl0 = blockIdx.y * 336 + batch_kl * 16;
445445
if (task_kl0 >= npairs_kl) {
446-
continue;
446+
break;
447447
}
448448
int pair_ij0 = pair_ij_mapping[task_ij0];
449449
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -821,10 +821,10 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
821821
for (int batch_kl = 0; batch_kl < 6; ++batch_kl) {
822822
int task_kl0 = blockIdx.y * 96 + batch_kl * 16;
823823
if (task_kl0 >= npairs_kl) {
824-
continue;
824+
break;
825825
}
826826
if (pair_ij_mapping == pair_kl_mapping && task_ij0+16 <= task_kl0) {
827-
continue;
827+
break;
828828
}
829829
int pair_ij0 = pair_ij_mapping[task_ij0];
830830
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -844,7 +844,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
844844
int kl_loc0 = pair_kl_loc[task_kl];
845845
if (pair_ij_mapping == pair_kl_mapping) {
846846
if (task_ij == task_kl) fac *= .5;
847-
if (task_ij < task_kl) fac = 0.;
847+
else if (task_ij < task_kl) fac = 0.;
848848
}
849849
__syncthreads();
850850
double xij = Rp_cache[tx+0];
@@ -1593,7 +1593,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 8) {
15931593
for (int batch_kl = 0; batch_kl < 16; ++batch_kl) {
15941594
int task_kl0 = blockIdx.y * 256 + batch_kl * 16;
15951595
if (task_kl0 >= npairs_kl) {
1596-
continue;
1596+
break;
15971597
}
15981598
int pair_ij0 = pair_ij_mapping[task_ij0];
15991599
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -2127,7 +2127,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
21272127
for (int batch_kl = 0; batch_kl < 10; ++batch_kl) {
21282128
int task_kl0 = blockIdx.y * 160 + batch_kl * 16;
21292129
if (task_kl0 >= npairs_kl) {
2130-
continue;
2130+
break;
21312131
}
21322132
int pair_ij0 = pair_ij_mapping[task_ij0];
21332133
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -3160,10 +3160,10 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
31603160
for (int batch_kl = 0; batch_kl < 4; ++batch_kl) {
31613161
int task_kl0 = blockIdx.y * 64 + batch_kl * 16;
31623162
if (task_kl0 >= npairs_kl) {
3163-
continue;
3163+
break;
31643164
}
31653165
if (pair_ij_mapping == pair_kl_mapping && task_ij0+16 <= task_kl0) {
3166-
continue;
3166+
break;
31673167
}
31683168
int pair_ij0 = pair_ij_mapping[task_ij0];
31693169
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -3183,7 +3183,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
31833183
int kl_loc0 = pair_kl_loc[task_kl];
31843184
if (pair_ij_mapping == pair_kl_mapping) {
31853185
if (task_ij == task_kl) fac *= .5;
3186-
if (task_ij < task_kl) fac = 0.;
3186+
else if (task_ij < task_kl) fac = 0.;
31873187
}
31883188
__syncthreads();
31893189
double xij = Rp_cache[tx+0];
@@ -5337,7 +5337,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
53375337
for (int batch_kl = 0; batch_kl < 21; ++batch_kl) {
53385338
int task_kl0 = blockIdx.y * 336 + batch_kl * 16;
53395339
if (task_kl0 >= npairs_kl) {
5340-
continue;
5340+
break;
53415341
}
53425342
int pair_ij0 = pair_ij_mapping[task_ij0];
53435343
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -5988,7 +5988,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 4) {
59885988
for (int batch_kl = 0; batch_kl < 6; ++batch_kl) {
59895989
int task_kl0 = blockIdx.y * 96 + batch_kl * 16;
59905990
if (task_kl0 >= npairs_kl) {
5991-
continue;
5991+
break;
59925992
}
59935993
int pair_ij0 = pair_ij_mapping[task_ij0];
59945994
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -7884,7 +7884,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 2) {
78847884
for (int batch_kl = 0; batch_kl < 24; ++batch_kl) {
78857885
int task_kl0 = blockIdx.y * 384 + batch_kl * 16;
78867886
if (task_kl0 >= npairs_kl) {
7887-
continue;
7887+
break;
78887888
}
78897889
int pair_ij0 = pair_ij_mapping[task_ij0];
78907890
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -8470,7 +8470,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 2) {
84708470
for (int batch_kl = 0; batch_kl < 9; ++batch_kl) {
84718471
int task_kl0 = blockIdx.y * 144 + batch_kl * 16;
84728472
if (task_kl0 >= npairs_kl) {
8473-
continue;
8473+
break;
84748474
}
84758475
int pair_ij0 = pair_ij_mapping[task_ij0];
84768476
int pair_kl0 = pair_kl_mapping[task_kl0];
@@ -10111,7 +10111,7 @@ for (int dm_offset = 0; dm_offset < jk.n_dm; dm_offset += 2) {
1011110111
for (int batch_kl = 0; batch_kl < 12; ++batch_kl) {
1011210112
int task_kl0 = blockIdx.y * 192 + batch_kl * 16;
1011310113
if (task_kl0 >= npairs_kl) {
10114-
continue;
10114+
break;
1011510115
}
1011610116
int pair_ij0 = pair_ij_mapping[task_ij0];
1011710117
int pair_kl0 = pair_kl_mapping[task_kl0];

0 commit comments

Comments
 (0)