[bug]: Graph SQL implementation results in some performance issues

### Pre-Submission Checklist

- [x] I have searched the existing issues and believe this is a new bug.
- [x] I am not asking a question about how to use lnd, but reporting a bug (otherwise open a discussion).

### LND Version

0.20 rc3

### LND Configuration

native sql storage active.

### Backend Version

neutrino

### Backend Configuration

-

### OS/Distribution

-

### Bug Details & Steps to Reproduce

# IsPublicNode Performance Bottleneck Analysis

## Executive Summary

CPU profiling reveals that `IsPublicNode` queries consume **62.36% of total CPU time** (21.77s out of 34.91s) during gossip message processing. This represents a critical performance bottleneck in LND's gossip subsystem that significantly impacts node synchronization and network message processing throughput.

## Problem Description

The `IsPublicNode` function is called for every node announcement received from the Lightning Network gossip protocol. During the profiling period of 10.10 seconds, this function was sampled 2,177 times, indicating extremely high call frequency.

### Call Stack
```
discovery.(*AuthenticatedGossiper).handleNodeAnnouncement (62.50% CPU)
  ↓
graph.(*Builder).IsPublicNode (62.36% CPU)
  ↓
graph/db.(*SQLStore).IsPublicNode (62.36% CPU)
  ↓
SQLite query execution (IsPublicV1Node)
```

### Profile Evidence
```
Duration: 10.10s, Total samples = 34.91s (345.47%)
      flat  flat%   sum%        cum   cum%
         0     0%     0%     21.86s 62.62%  discovery.(*AuthenticatedGossiper).processNetworkAnnouncement
         0     0%     0%     21.82s 62.50%  discovery.(*AuthenticatedGossiper).handleNodeAnnouncement
         0     0%     0%     21.77s 62.36%  graph.(*Builder).IsPublicNode
         0     0%     0%     21.77s 62.36%  graph/db.(*SQLStore).IsPublicNode
```

## Root Cause Analysis

### 1. Inefficient SQL Query Structure

**Location:** `sqldb/sqlc/queries/graph.sql:55-71`

```sql
-- name: IsPublicV1Node :one
SELECT EXISTS (
    SELECT 1
    FROM graph_channels c
    JOIN graph_nodes n ON n.id = c.node_id_1 OR n.id = c.node_id_2
    WHERE c.version = 1
      AND c.bitcoin_1_signature IS NOT NULL
      AND n.pub_key = $1
);
```

**Problem:** The `OR` condition in the JOIN clause (`n.id = c.node_id_1 OR n.id = c.node_id_2`) prevents SQLite from efficiently utilizing indexes. This forces a full or near-full table scan of `graph_channels` for each query.

**Existing Indexes:**
- `graph_channels_node_id_1_idx` on `node_id_1`
- `graph_channels_node_id_2_idx` on `node_id_2`
- `graph_nodes_unique` on `(pub_key, version)`

Despite these indexes, the OR condition in the JOIN prevents their effective use.

### 2. High Call Frequency

**Location:** `discovery/gossiper.go:2519`

```go
func (d *AuthenticatedGossiper) handleNodeAnnouncement(...) {
    // ... node already added to database at line 2499
    if err := d.addNode(ctx, nodeAnn, ops...); err != nil {
        // error handling
    }

    // IsPublicNode called AFTER node is added
    isPublic, err := d.cfg.Graph.IsPublicNode(nodeAnn.NodeID)
    if err != nil {
        // error handling
    }

    // Only used to decide whether to broadcast
    if isPublic {
        announcements = append(announcements, networkMsg{...})
    }
}
```

This function is called for **every node announcement** received, including:
- Initial gossip sync (thousands of announcements)
- Real-time gossip updates
- Re-announcements from peers

### 3. Redundant Database Operations

The query is executed **after** the node has already been added to the database. The check is only used to determine if the announcement should be rebroadcast to peers, but it requires scanning the entire channel graph each time.

### 4. Lock Contention

Secondary performance impact from database locking:
```
12.81s (36.69%) - syscall.syscall6
 8.75s (25.06%) - pthread_cond_wait
 5.67s (16.24%) - pthread_cond_signal
 5.34s (15.30%) - SQLite mutex operations
```

Multiple goroutines competing for database access during high-frequency queries exacerbates the problem.

## Performance Impact

### Quantitative Analysis
- **62.36% of CPU time** spent in a single query type
- **2,177 calls** in 10 seconds = ~217 calls/second
- **~10ms average per call** under load
- During initial sync with thousands of node announcements, this becomes a significant synchronization bottleneck

### User Impact
- Slower initial blockchain/graph synchronization
- Reduced gossip message processing throughput
- Increased CPU usage during normal operation
- Potential for gossip message queue backlog

## Proposed Solutions

### Solution 1: Query Rewrite with UNION (Immediate Fix)

**Complexity:** Low
**Impact:** High
**Risk:** Low

Rewrite the query to use UNION instead of OR, allowing index usage:

```sql
-- name: IsPublicV1Node :one
SELECT EXISTS (
    SELECT 1 FROM (
        SELECT c.id
        FROM graph_channels c
        JOIN graph_nodes n ON n.id = c.node_id_1
        WHERE c.version = 1
          AND c.bitcoin_1_signature IS NOT NULL
          AND n.pub_key = $1
        UNION
        SELECT c.id
        FROM graph_channels c
        JOIN graph_nodes n ON n.id = c.node_id_2
        WHERE c.version = 1
          AND c.bitcoin_1_signature IS NOT NULL
          AND n.pub_key = $1
    )
);
```

**Benefits:**
- Allows SQLite to use `graph_channels_node_id_1_idx` and `graph_channels_node_id_2_idx`
- No code changes required, only SQL modification
- Backward compatible

**Estimated Impact:** 10-50x performance improvement

### Solution 2: In-Memory Cache (Medium-term)

**Complexity:** Medium
**Impact:** Very High
**Risk:** Medium

Implement an LRU cache for `IsPublicNode` results:

```go
type PublicNodeCache struct {
    cache *lru.Cache
    mu    sync.RWMutex
}

func (s *SQLStore) IsPublicNode(pubKey [33]byte) (bool, error) {
    // Check cache first
    if cached, ok := s.publicNodeCache.Get(pubKey); ok {
        return cached.(bool), nil
    }

    // Query database
    isPublic, err := s.isPublicNodeQuery(pubKey)
    if err != nil {
        return false, err
    }

    // Cache result
    s.publicNodeCache.Add(pubKey, isPublic)
    return isPublic, nil
}
```

**Cache Invalidation Strategy:**
- Invalidate on channel addition/removal for affected nodes
- TTL of 5-10 minutes as safety mechanism
- Size limit of 10,000-50,000 entries (manageable memory footprint)

**Benefits:**
- Near-zero latency for repeated lookups
- Dramatically reduces database load
- Node public status rarely changes

**Challenges:**
- Cache invalidation complexity
- Memory overhead (acceptable: ~1-2MB for 50K entries)

**Estimated Impact:** 100-1000x improvement for cache hits

### Solution 3: Denormalize Public Status (Long-term)

**Complexity:** High
**Impact:** Very High
**Risk:** Medium-High

Add `is_public` boolean column to `graph_nodes` table:

```sql
ALTER TABLE graph_nodes ADD COLUMN is_public BOOLEAN DEFAULT FALSE;
CREATE INDEX graph_nodes_is_public_idx ON graph_nodes(is_public, pub_key);
```

Update logic:
- Compute `is_public` when channels are added/removed
- Simple `SELECT is_public FROM graph_nodes WHERE pub_key = $1` query
- No JOIN required

**Benefits:**
- O(1) lookup via primary key
- Eliminates JOIN entirely
- Most efficient long-term solution

**Challenges:**
- Schema migration required
- Complex update logic when channels change
- Need to backfill existing data
- Increased write complexity

**Estimated Impact:** 1000x+ improvement

### Solution 4: Batch Processing

**Complexity:** Medium
**Impact:** Medium
**Risk:** Low

Accumulate node announcements and check public status in batches:

```go
func (d *AuthenticatedGossiper) batchCheckPublicNodes(nodes [][33]byte) (map[[33]byte]bool, error) {
    // Single query with WHERE pub_key IN (?, ?, ...)
    return d.cfg.Graph.IsPublicNodeBatch(nodes)
}
```

**Benefits:**
- Reduces query count
- Better database query planner utilization
- Lower transaction overhead

**Challenges:**
- Increases latency for individual announcements
- Requires refactoring of gossip processing pipeline

**Estimated Impact:** 5-10x improvement

## Recommended Approach

**Phase 1 (Immediate):** Implement Solution 1 (Query Rewrite)
- Low risk, high reward
- Can be deployed immediately
- No architectural changes

**Phase 2 (Short-term):** Implement Solution 2 (LRU Cache)
- Combines with Phase 1 for maximum impact
- Cache hit rate likely >90% during normal operation
- Proven pattern already used in LND codebase

**Phase 3 (Long-term):** Consider Solution 3 (Denormalization)
- Evaluate based on Phase 1 & 2 results
- If performance is still inadequate, pursue denormalization
- Coordinate with graph database refactoring efforts

## Testing Recommendations

1. **Benchmark Testing:**
   - Create benchmark comparing current vs. optimized query
   - Test with realistic graph sizes (80K+ channels)
   - Measure under concurrent load

2. **Integration Testing:**
   - Test during initial sync
   - Test with high gossip message rate
   - Verify cache invalidation correctness (if implemented)

3. **Performance Validation:**
   - CPU profiling before/after
   - Gossip throughput measurement
   - Sync time comparison

## References

- CPU Profile: `cpu.prof` (Duration: 10.10s, Total samples: 34.91s)
- Code Locations:
  - Query: `sqldb/sqlc/queries/graph.sql:55-71`
  - Caller: `discovery/gossiper.go:2519`
  - DB Schema: `sqldb/sqlc/migrations/000008_graph.up.sql`

## Appendix: Alternative Query Plans Tested

### Option A: Separate Lookups
```sql
SELECT EXISTS (
    SELECT 1 FROM graph_channels c
    WHERE c.version = 1
      AND c.bitcoin_1_signature IS NOT NULL
      AND (c.node_id_1 = (SELECT id FROM graph_nodes WHERE pub_key = $1)
           OR c.node_id_2 = (SELECT id FROM graph_nodes WHERE pub_key = $1))
);
```
**Status:** Still uses OR, similar performance

### Option B: IN clause
```sql
SELECT EXISTS (
    SELECT 1 FROM graph_channels c
    WHERE c.version = 1
      AND c.bitcoin_1_signature IS NOT NULL
      AND (SELECT id FROM graph_nodes WHERE pub_key = $1) IN (c.node_id_1, c.node_id_2)
);
```
**Status:** Better than OR, worse than UNION


### Expected Behavior

-

### Debug Information

_No response_

### Environment

_No response_

[bug]: Graph SQL implementation results in some performance issues #10337

Description

Pre-Submission Checklist

LND Version

LND Configuration

Backend Version

Backend Configuration

OS/Distribution

Bug Details & Steps to Reproduce

IsPublicNode Performance Bottleneck Analysis

Executive Summary

Problem Description

Call Stack

Profile Evidence

Root Cause Analysis

1. Inefficient SQL Query Structure

2. High Call Frequency

3. Redundant Database Operations

4. Lock Contention

Performance Impact

Quantitative Analysis

User Impact

Proposed Solutions

Solution 1: Query Rewrite with UNION (Immediate Fix)

Solution 2: In-Memory Cache (Medium-term)

Solution 3: Denormalize Public Status (Long-term)

Solution 4: Batch Processing

Recommended Approach

Testing Recommendations

References

Appendix: Alternative Query Plans Tested

Option A: Separate Lookups

Option B: IN clause

Expected Behavior

Debug Information

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions