Skip to content

Conversation

@Kh4ster
Copy link
Contributor

@Kh4ster Kh4ster commented Jul 24, 2025

This is still under a lot of work.

This is to allow preliminary reviews.

Kh4ster added 30 commits July 2, 2025 18:33
…r of primal step size and dual step size, update the kernels to launch multiple threads and support a very wide batch size accordingly
… if batch is called with trust region restart
@Kh4ster Kh4ster added this to the 25.08 milestone Jul 24, 2025
@Kh4ster Kh4ster requested a review from a team as a code owner July 24, 2025 16:48
@Kh4ster Kh4ster added feature request New feature or request non-breaking Introduces a non-breaking change labels Jul 24, 2025
@Kh4ster Kh4ster requested review from hlinsen and kaatish July 24, 2025 16:48
@Kh4ster Kh4ster added the pdlp label Jul 24, 2025
@Kh4ster Kh4ster marked this pull request as draft July 24, 2025 16:49
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Kh4ster Kh4ster removed request for hlinsen and kaatish July 24, 2025 16:50
@Kh4ster Kh4ster self-assigned this Jul 24, 2025
namespace cuopt::linear_programming::detail {

// This class is used to start a batched dot product
// With large problem size (>10K) and small batch size (<100), this is faster than using Segmented Reduce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it I'm not surprised, iirc SegmentedReduce does a 1 block:1 segment mapping and in your case that's pretty terrible, I'm not surprised parallel device-wide blasdot calls beats it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I just realized they added a new overload optimized for fixed sizes, I wasn't aware of it, maybe this performs better?
NVIDIA/cccl#3969

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch!! I will test that right away. It might make my life way simpler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still slower than using multiple dot products :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang .-.
Looking at their benchmarks they only test for segment sizes up to 1024 so I guess they don't optimize whatsoever for few-segments scenarios. Would be nice if they said so in their docs!

@tmckayus
Copy link
Contributor

This is possibly a candidate for 25.10 but may still make 25.08

@rgsl888prabhu
Copy link
Collaborator

@Kh4ster Shall we move this to 25.10

@tmckayus
Copy link
Contributor

@Kh4ster Shall we move this to 25.10

I'm going to move this to 25.10, we can move it back if it gets finished

@tmckayus tmckayus modified the milestones: 25.08, 25.10 Jul 31, 2025
@anandhkb anandhkb modified the milestones: 25.10, 25.12 Sep 17, 2025
@anandhkb
Copy link
Contributor

De-prioritized for 25.10 and slating for 25.12 release

@rgsl888prabhu rgsl888prabhu changed the base branch from branch-25.08 to main October 22, 2025 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change pdlp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants