Skip to content

[bug]: Graph SQL implementation results in some performance issues #10337

@ziggie1984

Description

@ziggie1984

Pre-Submission Checklist

  • I have searched the existing issues and believe this is a new bug.
  • I am not asking a question about how to use lnd, but reporting a bug (otherwise open a discussion).

LND Version

0.20 rc3

LND Configuration

native sql storage active.

Backend Version

neutrino

Backend Configuration

OS/Distribution

Bug Details & Steps to Reproduce

IsPublicNode Performance Bottleneck Analysis

Executive Summary

CPU profiling reveals that IsPublicNode queries consume 62.36% of total CPU time (21.77s out of 34.91s) during gossip message processing. This represents a critical performance bottleneck in LND's gossip subsystem that significantly impacts node synchronization and network message processing throughput.

Problem Description

The IsPublicNode function is called for every node announcement received from the Lightning Network gossip protocol. During the profiling period of 10.10 seconds, this function was sampled 2,177 times, indicating extremely high call frequency.

Call Stack

discovery.(*AuthenticatedGossiper).handleNodeAnnouncement (62.50% CPU)
  ↓
graph.(*Builder).IsPublicNode (62.36% CPU)
  ↓
graph/db.(*SQLStore).IsPublicNode (62.36% CPU)
  ↓
SQLite query execution (IsPublicV1Node)

Profile Evidence

Duration: 10.10s, Total samples = 34.91s (345.47%)
      flat  flat%   sum%        cum   cum%
         0     0%     0%     21.86s 62.62%  discovery.(*AuthenticatedGossiper).processNetworkAnnouncement
         0     0%     0%     21.82s 62.50%  discovery.(*AuthenticatedGossiper).handleNodeAnnouncement
         0     0%     0%     21.77s 62.36%  graph.(*Builder).IsPublicNode
         0     0%     0%     21.77s 62.36%  graph/db.(*SQLStore).IsPublicNode

Root Cause Analysis

1. Inefficient SQL Query Structure

Location: sqldb/sqlc/queries/graph.sql:55-71

-- name: IsPublicV1Node :one
SELECT EXISTS (
    SELECT 1
    FROM graph_channels c
    JOIN graph_nodes n ON n.id = c.node_id_1 OR n.id = c.node_id_2
    WHERE c.version = 1
      AND c.bitcoin_1_signature IS NOT NULL
      AND n.pub_key = $1
);

Problem: The OR condition in the JOIN clause (n.id = c.node_id_1 OR n.id = c.node_id_2) prevents SQLite from efficiently utilizing indexes. This forces a full or near-full table scan of graph_channels for each query.

Existing Indexes:

  • graph_channels_node_id_1_idx on node_id_1
  • graph_channels_node_id_2_idx on node_id_2
  • graph_nodes_unique on (pub_key, version)

Despite these indexes, the OR condition in the JOIN prevents their effective use.

2. High Call Frequency

Location: discovery/gossiper.go:2519

func (d *AuthenticatedGossiper) handleNodeAnnouncement(...) {
    // ... node already added to database at line 2499
    if err := d.addNode(ctx, nodeAnn, ops...); err != nil {
        // error handling
    }

    // IsPublicNode called AFTER node is added
    isPublic, err := d.cfg.Graph.IsPublicNode(nodeAnn.NodeID)
    if err != nil {
        // error handling
    }

    // Only used to decide whether to broadcast
    if isPublic {
        announcements = append(announcements, networkMsg{...})
    }
}

This function is called for every node announcement received, including:

  • Initial gossip sync (thousands of announcements)
  • Real-time gossip updates
  • Re-announcements from peers

3. Redundant Database Operations

The query is executed after the node has already been added to the database. The check is only used to determine if the announcement should be rebroadcast to peers, but it requires scanning the entire channel graph each time.

4. Lock Contention

Secondary performance impact from database locking:

12.81s (36.69%) - syscall.syscall6
 8.75s (25.06%) - pthread_cond_wait
 5.67s (16.24%) - pthread_cond_signal
 5.34s (15.30%) - SQLite mutex operations

Multiple goroutines competing for database access during high-frequency queries exacerbates the problem.

Performance Impact

Quantitative Analysis

  • 62.36% of CPU time spent in a single query type
  • 2,177 calls in 10 seconds = ~217 calls/second
  • ~10ms average per call under load
  • During initial sync with thousands of node announcements, this becomes a significant synchronization bottleneck

User Impact

  • Slower initial blockchain/graph synchronization
  • Reduced gossip message processing throughput
  • Increased CPU usage during normal operation
  • Potential for gossip message queue backlog

Proposed Solutions

Solution 1: Query Rewrite with UNION (Immediate Fix)

Complexity: Low
Impact: High
Risk: Low

Rewrite the query to use UNION instead of OR, allowing index usage:

-- name: IsPublicV1Node :one
SELECT EXISTS (
    SELECT 1 FROM (
        SELECT c.id
        FROM graph_channels c
        JOIN graph_nodes n ON n.id = c.node_id_1
        WHERE c.version = 1
          AND c.bitcoin_1_signature IS NOT NULL
          AND n.pub_key = $1
        UNION
        SELECT c.id
        FROM graph_channels c
        JOIN graph_nodes n ON n.id = c.node_id_2
        WHERE c.version = 1
          AND c.bitcoin_1_signature IS NOT NULL
          AND n.pub_key = $1
    )
);

Benefits:

  • Allows SQLite to use graph_channels_node_id_1_idx and graph_channels_node_id_2_idx
  • No code changes required, only SQL modification
  • Backward compatible

Estimated Impact: 10-50x performance improvement

Solution 2: In-Memory Cache (Medium-term)

Complexity: Medium
Impact: Very High
Risk: Medium

Implement an LRU cache for IsPublicNode results:

type PublicNodeCache struct {
    cache *lru.Cache
    mu    sync.RWMutex
}

func (s *SQLStore) IsPublicNode(pubKey [33]byte) (bool, error) {
    // Check cache first
    if cached, ok := s.publicNodeCache.Get(pubKey); ok {
        return cached.(bool), nil
    }

    // Query database
    isPublic, err := s.isPublicNodeQuery(pubKey)
    if err != nil {
        return false, err
    }

    // Cache result
    s.publicNodeCache.Add(pubKey, isPublic)
    return isPublic, nil
}

Cache Invalidation Strategy:

  • Invalidate on channel addition/removal for affected nodes
  • TTL of 5-10 minutes as safety mechanism
  • Size limit of 10,000-50,000 entries (manageable memory footprint)

Benefits:

  • Near-zero latency for repeated lookups
  • Dramatically reduces database load
  • Node public status rarely changes

Challenges:

  • Cache invalidation complexity
  • Memory overhead (acceptable: ~1-2MB for 50K entries)

Estimated Impact: 100-1000x improvement for cache hits

Solution 3: Denormalize Public Status (Long-term)

Complexity: High
Impact: Very High
Risk: Medium-High

Add is_public boolean column to graph_nodes table:

ALTER TABLE graph_nodes ADD COLUMN is_public BOOLEAN DEFAULT FALSE;
CREATE INDEX graph_nodes_is_public_idx ON graph_nodes(is_public, pub_key);

Update logic:

  • Compute is_public when channels are added/removed
  • Simple SELECT is_public FROM graph_nodes WHERE pub_key = $1 query
  • No JOIN required

Benefits:

  • O(1) lookup via primary key
  • Eliminates JOIN entirely
  • Most efficient long-term solution

Challenges:

  • Schema migration required
  • Complex update logic when channels change
  • Need to backfill existing data
  • Increased write complexity

Estimated Impact: 1000x+ improvement

Solution 4: Batch Processing

Complexity: Medium
Impact: Medium
Risk: Low

Accumulate node announcements and check public status in batches:

func (d *AuthenticatedGossiper) batchCheckPublicNodes(nodes [][33]byte) (map[[33]byte]bool, error) {
    // Single query with WHERE pub_key IN (?, ?, ...)
    return d.cfg.Graph.IsPublicNodeBatch(nodes)
}

Benefits:

  • Reduces query count
  • Better database query planner utilization
  • Lower transaction overhead

Challenges:

  • Increases latency for individual announcements
  • Requires refactoring of gossip processing pipeline

Estimated Impact: 5-10x improvement

Recommended Approach

Phase 1 (Immediate): Implement Solution 1 (Query Rewrite)

  • Low risk, high reward
  • Can be deployed immediately
  • No architectural changes

Phase 2 (Short-term): Implement Solution 2 (LRU Cache)

  • Combines with Phase 1 for maximum impact
  • Cache hit rate likely >90% during normal operation
  • Proven pattern already used in LND codebase

Phase 3 (Long-term): Consider Solution 3 (Denormalization)

  • Evaluate based on Phase 1 & 2 results
  • If performance is still inadequate, pursue denormalization
  • Coordinate with graph database refactoring efforts

Testing Recommendations

  1. Benchmark Testing:

    • Create benchmark comparing current vs. optimized query
    • Test with realistic graph sizes (80K+ channels)
    • Measure under concurrent load
  2. Integration Testing:

    • Test during initial sync
    • Test with high gossip message rate
    • Verify cache invalidation correctness (if implemented)
  3. Performance Validation:

    • CPU profiling before/after
    • Gossip throughput measurement
    • Sync time comparison

References

  • CPU Profile: cpu.prof (Duration: 10.10s, Total samples: 34.91s)
  • Code Locations:
    • Query: sqldb/sqlc/queries/graph.sql:55-71
    • Caller: discovery/gossiper.go:2519
    • DB Schema: sqldb/sqlc/migrations/000008_graph.up.sql

Appendix: Alternative Query Plans Tested

Option A: Separate Lookups

SELECT EXISTS (
    SELECT 1 FROM graph_channels c
    WHERE c.version = 1
      AND c.bitcoin_1_signature IS NOT NULL
      AND (c.node_id_1 = (SELECT id FROM graph_nodes WHERE pub_key = $1)
           OR c.node_id_2 = (SELECT id FROM graph_nodes WHERE pub_key = $1))
);

Status: Still uses OR, similar performance

Option B: IN clause

SELECT EXISTS (
    SELECT 1 FROM graph_channels c
    WHERE c.version = 1
      AND c.bitcoin_1_signature IS NOT NULL
      AND (SELECT id FROM graph_nodes WHERE pub_key = $1) IN (c.node_id_1, c.node_id_2)
);

Status: Better than OR, worse than UNION

Expected Behavior

Debug Information

No response

Environment

No response

Metadata

Metadata

Assignees

Labels

bugUnintended code behaviourgraphsqlsqliteIssues related to SQLite

Type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions