-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Pre-Submission Checklist
- I have searched the existing issues and believe this is a new bug.
- I am not asking a question about how to use lnd, but reporting a bug (otherwise open a discussion).
LND Version
0.20 rc3
LND Configuration
native sql storage active.
Backend Version
neutrino
Backend Configuration
OS/Distribution
Bug Details & Steps to Reproduce
IsPublicNode Performance Bottleneck Analysis
Executive Summary
CPU profiling reveals that IsPublicNode queries consume 62.36% of total CPU time (21.77s out of 34.91s) during gossip message processing. This represents a critical performance bottleneck in LND's gossip subsystem that significantly impacts node synchronization and network message processing throughput.
Problem Description
The IsPublicNode function is called for every node announcement received from the Lightning Network gossip protocol. During the profiling period of 10.10 seconds, this function was sampled 2,177 times, indicating extremely high call frequency.
Call Stack
discovery.(*AuthenticatedGossiper).handleNodeAnnouncement (62.50% CPU)
↓
graph.(*Builder).IsPublicNode (62.36% CPU)
↓
graph/db.(*SQLStore).IsPublicNode (62.36% CPU)
↓
SQLite query execution (IsPublicV1Node)
Profile Evidence
Duration: 10.10s, Total samples = 34.91s (345.47%)
flat flat% sum% cum cum%
0 0% 0% 21.86s 62.62% discovery.(*AuthenticatedGossiper).processNetworkAnnouncement
0 0% 0% 21.82s 62.50% discovery.(*AuthenticatedGossiper).handleNodeAnnouncement
0 0% 0% 21.77s 62.36% graph.(*Builder).IsPublicNode
0 0% 0% 21.77s 62.36% graph/db.(*SQLStore).IsPublicNode
Root Cause Analysis
1. Inefficient SQL Query Structure
Location: sqldb/sqlc/queries/graph.sql:55-71
-- name: IsPublicV1Node :one
SELECT EXISTS (
SELECT 1
FROM graph_channels c
JOIN graph_nodes n ON n.id = c.node_id_1 OR n.id = c.node_id_2
WHERE c.version = 1
AND c.bitcoin_1_signature IS NOT NULL
AND n.pub_key = $1
);Problem: The OR condition in the JOIN clause (n.id = c.node_id_1 OR n.id = c.node_id_2) prevents SQLite from efficiently utilizing indexes. This forces a full or near-full table scan of graph_channels for each query.
Existing Indexes:
graph_channels_node_id_1_idxonnode_id_1graph_channels_node_id_2_idxonnode_id_2graph_nodes_uniqueon(pub_key, version)
Despite these indexes, the OR condition in the JOIN prevents their effective use.
2. High Call Frequency
Location: discovery/gossiper.go:2519
func (d *AuthenticatedGossiper) handleNodeAnnouncement(...) {
// ... node already added to database at line 2499
if err := d.addNode(ctx, nodeAnn, ops...); err != nil {
// error handling
}
// IsPublicNode called AFTER node is added
isPublic, err := d.cfg.Graph.IsPublicNode(nodeAnn.NodeID)
if err != nil {
// error handling
}
// Only used to decide whether to broadcast
if isPublic {
announcements = append(announcements, networkMsg{...})
}
}This function is called for every node announcement received, including:
- Initial gossip sync (thousands of announcements)
- Real-time gossip updates
- Re-announcements from peers
3. Redundant Database Operations
The query is executed after the node has already been added to the database. The check is only used to determine if the announcement should be rebroadcast to peers, but it requires scanning the entire channel graph each time.
4. Lock Contention
Secondary performance impact from database locking:
12.81s (36.69%) - syscall.syscall6
8.75s (25.06%) - pthread_cond_wait
5.67s (16.24%) - pthread_cond_signal
5.34s (15.30%) - SQLite mutex operations
Multiple goroutines competing for database access during high-frequency queries exacerbates the problem.
Performance Impact
Quantitative Analysis
- 62.36% of CPU time spent in a single query type
- 2,177 calls in 10 seconds = ~217 calls/second
- ~10ms average per call under load
- During initial sync with thousands of node announcements, this becomes a significant synchronization bottleneck
User Impact
- Slower initial blockchain/graph synchronization
- Reduced gossip message processing throughput
- Increased CPU usage during normal operation
- Potential for gossip message queue backlog
Proposed Solutions
Solution 1: Query Rewrite with UNION (Immediate Fix)
Complexity: Low
Impact: High
Risk: Low
Rewrite the query to use UNION instead of OR, allowing index usage:
-- name: IsPublicV1Node :one
SELECT EXISTS (
SELECT 1 FROM (
SELECT c.id
FROM graph_channels c
JOIN graph_nodes n ON n.id = c.node_id_1
WHERE c.version = 1
AND c.bitcoin_1_signature IS NOT NULL
AND n.pub_key = $1
UNION
SELECT c.id
FROM graph_channels c
JOIN graph_nodes n ON n.id = c.node_id_2
WHERE c.version = 1
AND c.bitcoin_1_signature IS NOT NULL
AND n.pub_key = $1
)
);Benefits:
- Allows SQLite to use
graph_channels_node_id_1_idxandgraph_channels_node_id_2_idx - No code changes required, only SQL modification
- Backward compatible
Estimated Impact: 10-50x performance improvement
Solution 2: In-Memory Cache (Medium-term)
Complexity: Medium
Impact: Very High
Risk: Medium
Implement an LRU cache for IsPublicNode results:
type PublicNodeCache struct {
cache *lru.Cache
mu sync.RWMutex
}
func (s *SQLStore) IsPublicNode(pubKey [33]byte) (bool, error) {
// Check cache first
if cached, ok := s.publicNodeCache.Get(pubKey); ok {
return cached.(bool), nil
}
// Query database
isPublic, err := s.isPublicNodeQuery(pubKey)
if err != nil {
return false, err
}
// Cache result
s.publicNodeCache.Add(pubKey, isPublic)
return isPublic, nil
}Cache Invalidation Strategy:
- Invalidate on channel addition/removal for affected nodes
- TTL of 5-10 minutes as safety mechanism
- Size limit of 10,000-50,000 entries (manageable memory footprint)
Benefits:
- Near-zero latency for repeated lookups
- Dramatically reduces database load
- Node public status rarely changes
Challenges:
- Cache invalidation complexity
- Memory overhead (acceptable: ~1-2MB for 50K entries)
Estimated Impact: 100-1000x improvement for cache hits
Solution 3: Denormalize Public Status (Long-term)
Complexity: High
Impact: Very High
Risk: Medium-High
Add is_public boolean column to graph_nodes table:
ALTER TABLE graph_nodes ADD COLUMN is_public BOOLEAN DEFAULT FALSE;
CREATE INDEX graph_nodes_is_public_idx ON graph_nodes(is_public, pub_key);Update logic:
- Compute
is_publicwhen channels are added/removed - Simple
SELECT is_public FROM graph_nodes WHERE pub_key = $1query - No JOIN required
Benefits:
- O(1) lookup via primary key
- Eliminates JOIN entirely
- Most efficient long-term solution
Challenges:
- Schema migration required
- Complex update logic when channels change
- Need to backfill existing data
- Increased write complexity
Estimated Impact: 1000x+ improvement
Solution 4: Batch Processing
Complexity: Medium
Impact: Medium
Risk: Low
Accumulate node announcements and check public status in batches:
func (d *AuthenticatedGossiper) batchCheckPublicNodes(nodes [][33]byte) (map[[33]byte]bool, error) {
// Single query with WHERE pub_key IN (?, ?, ...)
return d.cfg.Graph.IsPublicNodeBatch(nodes)
}Benefits:
- Reduces query count
- Better database query planner utilization
- Lower transaction overhead
Challenges:
- Increases latency for individual announcements
- Requires refactoring of gossip processing pipeline
Estimated Impact: 5-10x improvement
Recommended Approach
Phase 1 (Immediate): Implement Solution 1 (Query Rewrite)
- Low risk, high reward
- Can be deployed immediately
- No architectural changes
Phase 2 (Short-term): Implement Solution 2 (LRU Cache)
- Combines with Phase 1 for maximum impact
- Cache hit rate likely >90% during normal operation
- Proven pattern already used in LND codebase
Phase 3 (Long-term): Consider Solution 3 (Denormalization)
- Evaluate based on Phase 1 & 2 results
- If performance is still inadequate, pursue denormalization
- Coordinate with graph database refactoring efforts
Testing Recommendations
-
Benchmark Testing:
- Create benchmark comparing current vs. optimized query
- Test with realistic graph sizes (80K+ channels)
- Measure under concurrent load
-
Integration Testing:
- Test during initial sync
- Test with high gossip message rate
- Verify cache invalidation correctness (if implemented)
-
Performance Validation:
- CPU profiling before/after
- Gossip throughput measurement
- Sync time comparison
References
- CPU Profile:
cpu.prof(Duration: 10.10s, Total samples: 34.91s) - Code Locations:
- Query:
sqldb/sqlc/queries/graph.sql:55-71 - Caller:
discovery/gossiper.go:2519 - DB Schema:
sqldb/sqlc/migrations/000008_graph.up.sql
- Query:
Appendix: Alternative Query Plans Tested
Option A: Separate Lookups
SELECT EXISTS (
SELECT 1 FROM graph_channels c
WHERE c.version = 1
AND c.bitcoin_1_signature IS NOT NULL
AND (c.node_id_1 = (SELECT id FROM graph_nodes WHERE pub_key = $1)
OR c.node_id_2 = (SELECT id FROM graph_nodes WHERE pub_key = $1))
);Status: Still uses OR, similar performance
Option B: IN clause
SELECT EXISTS (
SELECT 1 FROM graph_channels c
WHERE c.version = 1
AND c.bitcoin_1_signature IS NOT NULL
AND (SELECT id FROM graph_nodes WHERE pub_key = $1) IN (c.node_id_1, c.node_id_2)
);Status: Better than OR, worse than UNION
Expected Behavior
Debug Information
No response
Environment
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status