Skip to content

Feat launch concurrency#44

Merged
hi-liang merged 13 commits intomasterfrom
feat-launch-concurrency
Apr 7, 2026
Merged

Feat launch concurrency#44
hi-liang merged 13 commits intomasterfrom
feat-launch-concurrency

Conversation

@hi-liang
Copy link
Copy Markdown
Member

@hi-liang hi-liang commented Mar 13, 2026

Replace the port assignment mutex with a semaphore so we can launch 2 instances at a time. this gives us some minor pipelining gains with docker daemon (see #41)

- Adds is_finalized to instances table
- Implements transaction-safe reservePort
- Tracks IsFinalized and launchSemaphore in Manager
- Marks instances finalized via batch query
- Allow up to 2 simultaneous launches, as testing shows no improvement
  beyond 2, probably due to docker global locks on network provisioning
  allowing improvement only on 1 pipeline stage
- Removes heavy global portLock bottleneck
- Implements NCPU-sized semaphore restricting docker load
- Pre-claims open ports via reservePort before starting containers
- Adds aggressive 5-min GC pruning for crashed unfinalized instances
- Adds 50-routines parallel checkout reservePort suite
- Refactors profile and db integration tests removing getFreePort
- Updates mock DB queries preventing SQLite foreign key errors
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR increases instance launch throughput by allowing limited parallel container startups and moving host-port allocation to an atomic, DB-backed reservation scheme to avoid global serialization.

Changes:

  • Replace the port-assignment mutex approach with (a) atomic port reservation in SQLite and (b) a launch semaphore to allow up to 2 concurrent container launches.
  • Extend the SQLite schema with instances.is_finalized plus several indexes, and add GC for stale unfinalized instances.
  • Add new tests/benchmarks around port reservation/bitset logic and build-instance lookup batching.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
README.md Documents CMGR_CONCURRENT_LAUNCHES configuration.
cmgr/types.go Adds launchSemaphore, switches prune gate to atomic.Int64, adds IsFinalized to instance metadata.
cmgr/prune_test.go Updates tests for the new atomic.Int64 API.
cmgr/profile_test_timestamps.go Adds timestamp helpers (currently not test-only due to filename).
cmgr/profile_common_test.go Adds profiling test/bench harness for launch preparation vs DB size.
cmgr/docker.go Adds docker ping retries, initializes launch semaphore, and uses pre-reserved host ports during container create.
cmgr/database.go Adds is_finalized, multiple indexes, _busy_timeout/_synchronous, and a usedPortBitset() helper.
cmgr/database_operations_test.go Adds tests for usedPortBitset, reservePort, and lookupBuildInstances.
cmgr/database_instances.go Implements reservePort, updates finalizeInstance, and adds lookupBuildInstances.
cmgr/concurrency_test.go Adds concurrent port reservation test.
cmgr/api.go Reserves ports earlier in newInstance, batches build instance lookup, updates prune gate usage, and GC’s unfinalized instances.
cmd/cmgrd/main.go Documents CMGR_CONCURRENT_LAUNCHES and CMGR_PRUNE_AGE.
cmd/cmgr/main.go Documents CMGR_CONCURRENT_LAUNCHES.
Comments suppressed due to low confidence (1)

cmgr/api.go:283

  • On error paths after openInstance(iMeta), the function returns without cleaning up the newly-created instance row (and, with the new logic, any reserved portAssignments). For example, if GetChallengeMetadata or startNetwork fails, this leaks an unfinalized instance until the later GC runs, temporarily consuming ports and DB rows. Consider calling m.stopInstance(iMeta) (or at least removeInstanceMetadata) before returning on these errors to release reservations immediately.
	m.checkPrune()

	cMeta, err := m.GetChallengeMetadata(build.Challenge)
	if err != nil {
		return 0, err
	}

	if m.portLow != 0 {
		revPortMap, err := m.getReversePortMap(build.Challenge)
		if err != nil {
			return 0, err
		}

		for _, image := range build.Images {
			if image.Host == "builder" {
				continue
			}
			for _, portStr := range image.Ports {
				portName := revPortMap[portStr]
				hostPort, err := m.reservePort(iMeta.Id, portName)
				if err != nil {
					m.stopInstance(iMeta)
					return 0, err
				}
				iMeta.Ports[portName] = hostPort
			}
		}
	}

	err = m.startNetwork(iMeta, cMeta.ChallengeOptions.NetworkOptions)
	if err != nil {
		return 0, err
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmgr/profile_test_timestamps_test.go
Comment thread cmgr/profile_common_test.go
Comment on lines +33 to +37
// Set port range to test bitset optimization (10000-63000)
// This ensures getFreePort uses the bitset logic instead of ephemeral ports
mgr.portLow = 10000
mgr.portHigh = 63000

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup sets mgr.portLow/portHigh to enable port-range logic, but the instance population loop relies on finalizeInstance() persisting InstanceMetadata.Ports. With the new logic, finalizeInstance() only inserts portAssignments when portLow==0, so these preloaded instances won’t actually create portAssignments rows, skewing the DB-load/profile results. Consider either reserving ports via reservePort() during setup or temporarily disabling port reservation for the population phase (as other tests do) so portAssignments reflect reality.

Suggested change
// Set port range to test bitset optimization (10000-63000)
// This ensures getFreePort uses the bitset logic instead of ephemeral ports
mgr.portLow = 10000
mgr.portHigh = 63000

Copilot uses AI. Check for mistakes.
Comment on lines +292 to +307
// Output to CSV for easy comparison
csvFile := fmt.Sprintf("profile_results_%d.csv", instanceCount)
f, err := os.Create(csvFile)
if err != nil {
t.Logf("Warning: could not create CSV file: %s", err)
} else {
defer f.Close()
fmt.Fprintf(f, "metric,duration_ms\n")
fmt.Fprintf(f, "lookupBuildMetadata,%f\n", step1Duration.Seconds()*1000)
fmt.Fprintf(f, "openInstance,%f\n", step2Duration.Seconds()*1000)
fmt.Fprintf(f, "GetChallengeMetadata,%f\n", step3Duration.Seconds()*1000)
fmt.Fprintf(f, "getReversePortMap,%f\n", step4Duration.Seconds()*1000)
fmt.Fprintf(f, "reservePort_total,%f\n", step5Duration.Seconds()*1000)
fmt.Fprintf(f, "total,%f\n", totalDuration.Seconds()*1000)
t.Logf("\nResults written to %s", csvFile)
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test writes CSV files (e.g. profile_results_*.csv) into the current working directory. That can pollute the repo checkout and/or fail in read-only CI environments. Prefer writing to t.TempDir() (or skipping file output entirely) so artifacts are isolated and automatically cleaned up.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +23
numPorts := m.portHigh - m.portLow + 1

for attempt := 0; attempt < 50; attempt++ {
// Get currently used ports to find candidates
bitset, err := m.usedPortBitset()
if err != nil {
return 0, err
}

port := rand.Intn(numPorts) + m.portLow
var candidate int
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reservePort uses the package-global math/rand (rand.Intn) which is never seeded anywhere in this repo (no rand.Seed found). That makes port selection deterministic across process starts, increasing collision probability under concurrent launches and making behavior harder to reason about. Prefer using a properly-seeded RNG here (e.g., seed the global source once at startup, or use a per-Manager RNG with appropriate locking).

Copilot uses AI. Check for mistakes.
Comment thread cmgr/concurrency_test.go Outdated
Comment thread cmgr/database_operations_test.go Outdated
Comment on lines +1026 to +1033
// Need a dummy instance ID from setup
var instId InstanceId
mgr.db.QueryRow("SELECT id FROM instances LIMIT 1").Scan(&instId)

port, err := mgr.reservePort(instId, "test")
if err != nil {
t.Fatalf("reservePort failed: %s", err)
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests use QueryRow(...).Scan(&instId) without checking the returned error. If the query returns no rows (or the scan fails), instId will remain zero and the test may exercise reservePort with an unintended instance id. Please capture and assert the Scan error.

Copilot uses AI. Check for mistakes.
@hi-liang
Copy link
Copy Markdown
Member Author

hi-liang commented Apr 1, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown

Copilot AI commented Apr 1, 2026

@copilot apply changes based on the comments in this thread

All changes from the review have been applied in commit 7e3dfad:

  • profile_test_timestamps.go → renamed to profile_test_timestamps_test.go so it's excluded from non-test builds
  • TestProfileInstanceLaunchWithDBLoad now skips unless CMGR_TEST_PROFILE=1 is set (opt-in instead of just testing.Short())
  • CSV output now writes to t.TempDir() instead of the current working directory
  • reservePort rand seeding: moved the per-Manager RNG init into initDatabase (so it's always seeded even in tests that bypass NewManager), added a randMu sync.Mutex to Manager for safe concurrent access, and changed reservePort to use m.rand.Intn under that mutex instead of the package-global rand.Intn
  • Error checks added for addChallenges, openBuild, and openInstance in concurrency_test.go and database_operations_test.go
  • QueryRow(...).Scan errors are now checked and fatal in both places in database_operations_test.go

@hi-liang hi-liang merged commit 8d250ff into master Apr 7, 2026
1 check passed
@hi-liang hi-liang deleted the feat-launch-concurrency branch April 7, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants