diff --git a/projects/auth/PROJ-148-auth-sidecar.md b/projects/auth/PROJ-148-auth-sidecar.md new file mode 100644 index 00000000..35115638 --- /dev/null +++ b/projects/auth/PROJ-148-auth-sidecar.md @@ -0,0 +1,551 @@ + + +# Authorization Sidecar (authz_sidecar) + +**Author**: @RyaliNvidia
+**PIC**: @RyaliNvidia
+**Proposal Issue**: [#148](https://github.com/NVIDIA/OSMO/issues/148) + +## Overview + +The Authorization Sidecar project implements a high-performance Golang gRPC service that provides +centralized role-based access control (RBAC) for all OSMO services through Envoy's External +Authorization API. This replaces the Python `AccessControlMiddleware` with a more efficient, +scalable, and maintainable solution. + +### Motivation + +- **Centralized Logic**: OSMO is slowing converting microservices to Go. By introducing a sidecar, OSMO wouldn't need an auto mechanism for both the Python + Go services +- **Scalability**: Rate-limiting is now be applied with authorized users. Currently, users can be rate-limited by non-authorized users which causes heavy APIs where we have lower rate limiting thresholds can be severly impacted. +- **Performance**: Go is **2-3x faster** at low load and **8-10x faster** at high load (Performance numbers are detailed in a section below) +- **Maintainability**: Separates authorization logic from application code. Many services like logger and router do not need to access to postgres, but currently, postgres information needs to be injected into the container due to the Middleware. + +### Problem + +Currently, authorization is implemented as Python ASGI middleware (`AccessControlMiddleware`) embedded in each service. This approach has several limitations: + +1. **Performance Overhead**: Python vs Go +2. **Limited Caching**: No role cache, leading to redundant database queries +4. **Tight Coupling**: Authorization logic intertwined with service logic + +## Use Cases + +| Use Case | Description | +|---|---| +| User accesses workflow API | User with `osmo-user` role requests `/api/workflow` - authz_sidecar checks role policies, queries cache/database, and allows access | +| Admin accesses all endpoints | User with `osmo-admin` role can access all endpoints except those with deny patterns (e.g., `!/api/agent/*`) | +| Unauthenticated public access | Anonymous user accesses `/api/version` or `/health` - allowed via `osmo-default` role | +| Unauthorized access attempt | User without proper roles attempts `/api/workflow` - authz_sidecar denies with 403 Forbidden | +| Policy cache hit | Repeated requests for same role combination served from cache in <5ms | +| Policy update | Database role policies updated - reflected in authorization after cache TTL expires (5 minutes) | + +## Requirements + +| Title | Description | Type | +|---|---|---| +| Envoy External Authorization API | authz_sidecar shall implement Envoy's External Authorization v3 gRPC API | Functional | +| Role-based access control | Given user roles in `x-osmo-roles` header, authz_sidecar shall query PostgreSQL and validate access per role policies | Functional | +| Path pattern matching | authz_sidecar shall support glob patterns (`/api/workflow/*`) and deny patterns (`!/api/admin/*`) for path matching | Functional | +| Authorization latency (cached) | authz_sidecar shall respond to cached authorization requests in <5ms at p99 | KPI | +| Authorization latency (uncached) | authz_sidecar shall respond to uncached authorization requests in <30ms at p99 | KPI | +| Throughput | authz_sidecar shall handle 1000+ authorization requests per second per instance | KPI | +| Role caching | authz_sidecar shall cache role policies in memory with configurable TTL (default: 5 minutes) | Performance | +| Connection pooling | authz_sidecar shall maintain PostgreSQL connection pool (default: 10 connections) | Performance | +| Secure secrets | Database passwords shall be loaded from Kubernetes secrets, never hardcoded | Security | +| Least privilege | authz_sidecar shall run as non-root user with all capabilities dropped | Security | +| Health checks | authz_sidecar shall expose gRPC health check endpoint for Kubernetes probes | Reliability | +| Graceful degradation | On authz_sidecar failure, Envoy shall deny requests (`failure_mode_allow: false`) | Security | + +## Architectural Details + +### High-Level Architecture + +```mermaid +flowchart LR + Client --> Envoy["Envoy Proxy
(Port 80)"] + Envoy --> AuthzSidecar["authz_sidecar
(Port 50052)"] + AuthzSidecar --> PostgreSQL["PostgreSQL
(roles table)"] + AuthzSidecar -->|ALLOW / DENY| Envoy + Envoy --> Service["Service
(Port 5000)"] +``` + +### Key Components + +1. **Envoy Proxy**: HTTP/gRPC filter chain with External Authorization filter +2. **authz_sidecar**: Golang gRPC service implementing authorization logic +3. **PostgreSQL**: Stores role definitions and policies (JSONB array) +4. **Role Cache**: In-memory LRU cache with TTL for role policies + +### Request Flow + +1. **Client Request** → Envoy Proxy (port 80/443) +2. **JWT Validation** → Extract user roles to `x-osmo-roles` header (Lua filter) +3. **External Authorization** → Envoy calls authz_sidecar via gRPC: + - Sends `CheckRequest` with path, method, headers + - authz_sidecar checks role cache + - On cache miss: queries PostgreSQL for role policies + - Evaluates policies against request + - Returns `CheckResponse` (OK or PERMISSION_DENIED) +4. **Continue or Reject** → Envoy routes to service or returns 403 + +## Detailed Design + +### 1. Authorization Service (Golang) + +**Location**: `external/src/service/authz_sidecar/` + +**Structure**: +``` +authz_sidecar/ +├── main.go # Server initialization, gRPC setup +├── server/ +│ ├── authz_server.go # Envoy External Authorization implementation +│ └── role_cache.go # In-memory role cache with TTL +├── integration_test.go # Integration tests +utils_go/ +└── postgres/ + ├── postgres.go # PostgreSQL connector and role query logic +``` + +**Key Interfaces**: + +```go +// Envoy External Authorization API +func (s *AuthzServer) Check(ctx context.Context, req *CheckRequest) (*CheckResponse, error) + +// Authorization logic +func (s *AuthzServer) checkAccess(ctx context.Context, path, method string, roles []string) (bool, error) +func (s *AuthzServer) hasAccess(role *Role, path, method string) bool +func (s *AuthzServer) matchPathPattern(pattern, path string) bool +``` + +### 2. PostgreSQL Integration + +**Location**: `external/src/service/utils_go/postgres/` + +**Database Schema**: +```sql +CREATE TABLE roles ( + name VARCHAR PRIMARY KEY, + description TEXT, + policies JSONB[], + immutable BOOLEAN DEFAULT FALSE +); +``` + +**Policy Structure**: +```json +{ + "actions": [ + {"base": "http", "path": "/api/workflow/*", "method": "GET"}, + {"base": "http", "path": "!/api/admin/*", "method": "*"} + ] +} +``` + +**Query Pattern**: +```sql +SELECT name, description, array_to_json(policies)::text, immutable +FROM roles +WHERE name = ANY($1) +ORDER BY name; +``` + +### 3. Role Caching + +**Implementation**: Thread-safe in-memory cache with LRU eviction + +**Cache Key**: Sorted, comma-separated role names (e.g., `"osmo-default,osmo-user"`) + +**Features**: +- TTL-based expiration (default: 5 minutes) +- Max size limit (default: 1000 entries) +- Thread-safe with `sync.RWMutex` +- Cache statistics for monitoring + +### 4. Envoy Configuration + +**Filter Position**: After JWT/roles extraction, before rate limiting + +**External Authorization Filter**: +```yaml +- name: envoy.filters.http.ext_authz + typed_config: + "@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz + transport_api_version: V3 + failure_mode_allow: false + grpc_service: + envoy_grpc: + cluster_name: authz-sidecar + timeout: 0.5s +``` + +**Cluster Definition**: +```yaml +- name: authz-sidecar + connect_timeout: 0.25s + type: STRICT_DNS + lb_policy: ROUND_ROBIN + load_assignment: + cluster_name: authz-sidecar + endpoints: + - lb_endpoints: + - endpoint: + address: + socket_address: + address: 127.0.0.1 + port_value: 50052 +``` + +### 5. Helm Chart Integration + +**Sidecar Container**: Defined in `_sidecar-helpers.tpl` as `osmo.authz-sidecar-container` + +**Configuration**: +```yaml +sidecars: + authz: + enabled: true + image: {{ .Values.global.osmoImageLocation }}/authz-sidecar:{{ .Values.global.osmoImageTag }} + grpcPort: 50052 + postgres: + host: postgres + database: osmo_db + passwordSecretName: postgres-secret + cache: + enabled: true + ttl: 300s + maxSize: 1000 +``` + +### Alternatives Considered + +#### Alternative 1: Keep Python Middleware + +* **Pros**: No migration needed, well-tested +* **Cons**: + * Need to write a separate logic for both Python and Go services + * Middleware has to be behind rate-limiting which can cause performance issues for users + * Services that don't require postgres will need to have postgres for the Middleware +* **Why not chosen**: Based on the cons + +#### Alternative 2: Centralized Authorization Service + +* **Pros**: Single service to manage, easier updates +* **Cons**: Single point of failure, network latency, doesn't scale horizontally +* **Why not chosen**: Sidecar pattern provides better availability and scales with services + +#### Alternative 3: Open Policy Agent (OPA) + +* **Pros**: Industry-standard, feature-rich policy engine +* **Cons**: Requires OPAL[https://github.com/permitio/opal] for on the fly updates to configuration between the config service and OPA +* **Why not chosen**: Less reliance on 3rd party tools + +### Backwards Compatibility + +**Migration Strategy**: +1. **Phase 1**: Deploy authz_sidecar alongside Python middleware (both enabled) +2. **Phase 2**: Monitor and validate authz_sidecar authorization decisions +3. **Phase 3**: Disable Python middleware once authz_sidecar is proven +4. **Phase 4**: Remove Python middleware code from services + +**No Breaking Changes**: Authorization behavior remains identical; only implementation changes. + +### Performance + +**Benchmark Overview**: + +The performance comparison between the Python `AccessControlMiddleware` and Go `authz_sidecar` was conducted under two scenarios to provide a complete performance picture: + +1. **Scenario 1 (WITHOUT Connection Pooling)**: New connection created for each request - demonstrates protocol overhead +2. **Scenario 2 (WITH Connection Pooling)**: Connections reused - reflects production deployment patterns + +All tests measure authorization checks with cache hits (worst-case includes database query overhead). + +--- + +#### Scenario 1: WITHOUT Connection Pooling + +**Low Load (Sequential Requests)**: + +| Metric | Python | Go | Speedup | +|--------|--------|-----|---------| +| Avg Latency | 893µs | 1.6ms | 0.6x | +| P50 Latency | 731µs | 1.5ms | 0.5x | +| P95 Latency | 1.7ms | 2.6ms | 0.6x | +| P99 Latency | 1.9ms | 3.2ms | 0.6x | + +**High Load (Concurrent Requests)**: + +| Clients | Python Throughput | Go Throughput | Speedup | Python Latency (Avg) | Go Latency (Avg) | +|---------|------------------|---------------|---------|---------------------|------------------| +| 50 | 2,250 req/s | 8,586 req/s | 3.8x | 22.2ms | 5.8ms | +| 100 | 1,993 req/s | 12,320 req/s | 6.2x | 50.2ms | 8.1ms | +| 200 | 1,943 req/s | 18,233 req/s | 9.4x | 103.5ms | 11.0ms | + +**Why Python appears faster at low load**: +- HTTP/1.1 has lower connection setup cost (~300µs) than gRPC/HTTP2 (~700µs) +- gRPC uses HTTP/2 which requires more complex handshaking (SETTINGS frames, binary framing) +- This is due to the tradeoff: **HTTP = fast to connect, gRPC = fast once connected** +- ⚠️ **This scenario does NOT represent production usage** (Envoy maintains persistent connections) + +--- + +#### Scenario 2: WITH Connection Pooling (Production Performance) + +**Low Load (Sequential Requests)**: + +| Metric | Python | Go | Speedup | +|--------|--------|-----|---------| +| Avg Latency | 670µs | 303µs | **2.2x** | +| P50 Latency | 552µs | 290µs | **1.9x** | +| P95 Latency | 1.3ms | 404µs | **3.3x** | +| P99 Latency | 1.6ms | 600µs | **2.6x** | + +**High Load (Concurrent Requests)**: + +| Clients | Python Throughput | Go Throughput | Speedup | Python Latency (Avg) | Go Latency (Avg) | +|---------|------------------|---------------|---------|---------------------|------------------| +| 50 | 2,986 req/s | 24,158 req/s | **8.1x** | 16.8ms | 2.1ms | +| 100 | 2,796 req/s | 22,050 req/s | **7.9x** | 35.8ms | 4.6ms | +| 200 | 2,444 req/s | 25,069 req/s | **10.3x** | 82.1ms | 8.0ms | + +--- + +#### Performance Analysis + +**Connection Pooling Impact**: +- Eliminating connection overhead reveals Go's true performance advantage +- Go authz_sidecar is **2-3x faster** at low load and **8-10x faster** at high load +- Production deployments use persistent connections (Scenario 2) + +**Concurrency Scaling**: + +| Implementation | 50 Clients | 100 Clients | 200 Clients | Scaling Behavior | +|----------------|-----------|------------|------------|------------------| +| Python | 2,986 req/s | 2,796 req/s | 2,444 req/s | **Plateaus & degrades** | +| Go | 24,158 req/s | 22,050 req/s | 25,069 req/s | **Remains consistent** | + +- **Python**: Throughput plateaus/degrades due to: + - Global Interpreter Lock (GIL) limiting CPU parallelism + - asyncio event loop contention under high concurrency + - Higher memory pressure and context switching overhead + +- **Go**: Throughput remains stable/scales due to: + - True parallelism with goroutines (no GIL) + - Efficient M:N scheduling (goroutines on OS threads) + - Lower memory footprint per concurrent request + +**Production Performance (With Connection Pooling)**: +- **Latency**: + - P50: 290µs (Go) vs 552µs (Python) - Go is 1.9x faster + - P99: 600µs (Go) vs 1.6ms (Python) - Go is 2.6x faster (under 1ms SLO) +- **Throughput**: + - Low load (50 clients): 24,158 req/s (Go) vs 2,986 req/s (Python) - 8x improvement + - High load (200 clients): 25,069 req/s (Go) vs 2,444 req/s (Python) - 10x improvement +- **Memory usage**: ~100-200Mi per sidecar (Go) vs ~300-500Mi (Python) +- **CPU usage**: ~100-200m per sidecar (Go) vs ~400-800m (Python) + +--- + +#### Key Takeaways + +✅ **Scenario 2 (WITH pooling) reflects production performance** - Envoy maintains persistent gRPC connections + +✅ **Go authz_sidecar significantly outperforms Python** when tested fairly: + - 2-3x faster latency at low load + - 8-10x higher throughput at high load + - Better tail latencies (P95, P99) + +✅ **Go's advantage increases with load** - Critical for high-traffic services: + - Python throughput degrades: 2,986 → 2,796 → 2,444 req/s (50→100→200 clients) + - Go throughput remains stable: 24,158 → 22,050 → 25,069 req/s (50→100→200 clients) + +✅ **Resource efficiency** - Go uses 50-60% less CPU and memory than Python + +⚠️ **Connection overhead matters** - But only in unpooled scenarios (not production): + - gRPC has higher connection cost (~700µs) than HTTP (~300µs) + - This is why Scenario 1 shows Python faster at low load + - Production deployments always use connection pooling + +**Optimization Strategies**: +1. **Caching**: 5-minute TTL reduces database load by >99% (cache hit rate) +2. **Connection Pooling**: Reuse PostgreSQL connections (10 max connections) +3. **gRPC Efficiency**: HTTP/2 multiplexing, protobuf binary serialization +4. **Short-circuit Evaluation**: Return on first matching policy (no wasted checks) +5. **Goroutine Concurrency**: Handles 200+ concurrent requests with minimal overhead + +### Operations + +**Deployment**: +- Deployed as sidecar container alongside Envoy in service pods +- Configured via Helm chart values +- No separate deployment or service needed + +**Monitoring**: +- Structured JSON logs with slog +- gRPC health checks for Kubernetes probes +- Authorization decision logging (allow/deny with context) + +**Configuration**: +- All settings via command-line flags +- Database password from Kubernetes secret +- No configuration files needed + +### Security + +**Threat Mitigations**: +1. **Header Injection**: Prevented by Envoy's `strip-unauthorized-headers` filter +2. **Database Access**: Read-only access to roles table; credentials in K8s secrets +3. **Cache Poisoning**: Cache keys are deterministic; not user-controllable +4. **Bypass Attempts**: `failure_mode_allow: false` denies on authz errors + +**Security Controls**: +- Runs as non-root user (UID 10001) +- All capabilities dropped +- Communicates with Envoy over localhost only +- Database SSL enabled in production + +### Documentation + +**Created**: +- ✅ `external/deployments/charts/service/AUTHZ_SIDECAR.md` - Helm integration guide +- ✅ `external/authz-sidecar-design.md` - Detailed design reference +- ✅ BUILD file comments for integration tests + +**To Update**: +- Service deployment guides to reference authz_sidecar option +- Architecture diagrams showing sidecar pattern +- Runbooks for troubleshooting authorization issues + +### Testing + +**Unit Tests**: +- `authz_server_test.go`: Path matching, method matching, policy evaluation +- `role_cache_test.go`: Cache operations, TTL expiration, LRU eviction +- `postgres_client_test.go`: Data structure validation + +**Integration Tests**: +- `authz_sidecar/integration_test.go`: Live service health and authorization tests +- `postgres/postgres_integration_test.go`: Database connectivity and role fetching +- Tags: `manual`, `external` (requires running services) + +**Test Commands**: +```bash +# Unit tests +bazel test //src/service/authz_sidecar/server:server_test +bazel test //src/service/utils_go/postgres:postgres_test + +# Integration tests (requires running PostgreSQL and authz_sidecar) +bazel test //src/service/utils_go/postgres:postgres_integration_test +bazel test //src/service/authz_sidecar:authz_sidecar_integration_test + +# Race detection +bazel test ... --@io_bazel_rules_go//go/config:race +``` + +**Test Metrics**: +- Unit test coverage: >80% for authorization logic +- Integration test success rate: 100% +- Performance test: p99 latency <5ms (cached), <30ms (uncached) + +### Dependencies + +**Upstream Dependencies** (impacts this project): +- Envoy Proxy: External Authorization filter configuration +- PostgreSQL: Database schema and role data +- Helm charts: Sidecar injection templates + +**Downstream Dependencies** (impacted by this project): +- All OSMO services: Authorization now handled by sidecar instead of middleware +- Monitoring/logging systems: New log format and metrics +- Deployment pipelines: New Docker image to build and deploy + +## Implementation Plan + +### Phase 1: Core Service ✅ **COMPLETED** +- [x] Implement Golang gRPC server with External Authorization API +- [x] Implement PostgreSQL client with connection pooling +- [x] Implement role cache with TTL and LRU eviction +- [x] Add unit tests for all components +- [x] Create integration tests for service and database + +### Phase 2: Helm Integration ✅ **COMPLETED** +- [x] Add authz configuration to `values.yaml` +- [x] Create sidecar helper template +- [x] Update Envoy configuration with ext_authz filter +- [x] Add authz-sidecar cluster definition +- [x] Inject authz_sidecar into service deployments + +### Phase 3: Testing & Validation (IN PROGRESS) +- [ ] Deploy to development environment +- [ ] Integration testing with real workloads +- [ ] Monitor cache hit rates (target >95%) + +### Phase 4: Cleanup (FUTURE) +- [ ] Remove Python `AccessControlMiddleware` from services +- [ ] Remove `check_user_access()` function calls +- [ ] Update documentation to reflect new architecture + +## Open Questions + +- [x] ~~Should we support both OPA and authz_sidecar simultaneously?~~ + - **Decision**: Yes, controlled by `sidecars.authz.enabled` flag + +- [x] ~~What cache TTL provides the best balance between freshness and performance?~~ + - **Decision**: 5 minutes (configurable); allows policy updates without restart while maintaining >95% hit rate + +- [ ] Should we implement active cache invalidation (PostgreSQL LISTEN/NOTIFY)? + - **Status**: Deferred to future enhancement; TTL-based expiration sufficient for initial release + +- [ ] Do we need to support attribute-based access control (ABAC) in the future? + - **Status**: Out of scope for initial release; can be added as enhancement + +## Appendix + +### File Locations + +**Service Code**: +- `external/src/service/authz_sidecar/main.go` - Entry point +- `external/src/service/authz_sidecar/server/authz_server.go` - gRPC implementation +- `external/src/service/authz_sidecar/server/role_cache.go` - Cache implementation +- `external/src/service/utils_go/postgres/postgres_client.go` - Database client + +**Helm Charts**: +- `external/deployments/charts/service/values.yaml` - Configuration (sidecars.authz) +- `external/deployments/charts/service/templates/_sidecar-helpers.tpl` - Sidecar template +- `external/deployments/charts/service/templates/_envoy-config.tpl` - Envoy filter config + +**Tests**: +- `external/src/service/authz_sidecar/integration_test.go` - Service integration tests +- `external/src/service/utils_go/postgres/postgres_integration_test.go` - Database tests + +### Build Commands + +```bash +# Build binary +bazel build //src/service/authz_sidecar:authz_sidecar_bin + +# Build Docker image +bazel build //src/service/authz_sidecar:authz_sidecar_image + +# Run locally +bazel run //src/service/authz_sidecar:authz_sidecar_bin -- \ + --postgres-host=localhost --postgres-db=osmo_db --postgres-password=osmo +``` diff --git a/projects/auth/PROJ-148-direct-idp-integration.md b/projects/auth/PROJ-148-direct-idp-integration.md new file mode 100644 index 00000000..29d55520 --- /dev/null +++ b/projects/auth/PROJ-148-direct-idp-integration.md @@ -0,0 +1,1094 @@ + + +# Direct IDP Integration: Removing Keycloak + +**Author**: @RyaliNvidia
+**PIC**: @RyaliNvidia
+**Proposal Issue**: [#148](https://github.com/NVIDIA/OSMO/issues/148) + +## Overview + +This document describes how to configure Envoy to authenticate users directly with external +identity providers (Microsoft Entra ID, Google, and Amazon Cognito) without using Keycloak as +an intermediary. It also covers the new Role Management APIs for assigning and removing users from roles. + +### Motivation + +- **Reduce complexity** — Eliminate Keycloak as a dependency, reducing deployment and maintenance overhead +- **Direct integration** — Use organization's existing identity provider directly for SSO +- **Simplified architecture** — Fewer moving parts means easier debugging and operations +- **Cost reduction** — One less service to deploy, scale, and maintain + +### Problem + +Using Keycloak as an identity broker introduces several issues that complicate the authentication and authorization flow: + +1. **Duplicated Role Management** — Roles must be defined in two places: + - In Keycloak (clients, realm roles, group mappings) + - In OSMO's database (role policies, user-role assignments) + + This duplication leads to synchronization challenges and inconsistent state when roles are updated in one system but not the other. + +2. **Complex Role Mapping** — Keycloak requires roles to be created in both `osmo-browser-flow` and `osmo-device` clients, then mapped to groups, then groups assigned to users. A single role assignment requires: + - Create role in `osmo-browser-flow` client + - Create same role in `osmo-device` client + - Create a group + - Assign both roles to the group + - Add user to the group + - Create matching role policy in OSMO database + +3. **Opaque Token Claims** — Keycloak transforms and relays claims from the upstream IDP, making it difficult to: + - Debug authentication issues (which system rejected the token?) + - Understand what claims are available from the original IDP + - Leverage IDP-specific features (Azure AD groups, Google Workspace domains, Cognito custom attributes) + +4. **Operational Overhead** — Keycloak adds significant operational burden: + - Separate database (PostgreSQL) for Keycloak state + - Realm configuration management and backup + - Version upgrades and security patches + - Additional monitoring and alerting + - Troubleshooting authentication flows across three systems (IDP → Keycloak → OSMO) + +5. **User Experience Friction** — The extra hop through Keycloak adds: + - Additional redirect latency during login + - Potential for session mismatches between Keycloak and OSMO + - Confusing logout behavior (must log out of both systems) + +**By removing Keycloak and connecting Envoy directly to the IDP**, we achieve: + +- **Single source of truth** for role assignments (OSMO database only) +- **Simplified role management** via REST APIs +- **Direct access** to IDP claims without transformation +- **Reduced latency** in authentication flow +- **Easier debugging** with fewer components in the chain +- **Lower operational costs** with one less service to maintain + +### Architecture Comparison + +**Before (with Keycloak):** +```mermaid +flowchart LR + Browser --> Envoy["Envoy
(OAuth2)"] + Envoy --> Keycloak["Keycloak
(Broker)"] + Keycloak --> IDP["IDP
(MS/GG/AMZ)"] +``` + +**After (Direct IDP):** +```mermaid +flowchart LR + Browser --> Envoy["Envoy
(OAuth2)"] + Envoy --> IDP["IDP
(MS/GG/AMZ)"] +``` + +--- + +## Table of Contents + +1. [Prerequisites](#prerequisites) +2. [Microsoft Entra ID (Azure AD)](#microsoft-entra-id-azure-ad) +3. [Google OAuth2](#google-oauth2) +4. [AWS IAM Identity Center (AWS SSO)](#aws-iam-identity-center-aws-sso) +5. [Envoy Configuration](#envoy-configuration) +6. [Role Management APIs](#role-management-apis) +7. [Verification Steps](#verification-steps) +8. [Migration from Keycloak](#migration-from-keycloak) +9. [Troubleshooting](#troubleshooting) + +--- + +### Common Values + +Throughout this guide, replace these placeholders: + +| Placeholder | Description | Example | +|-------------|-------------|---------| +| `` | Your OSMO service hostname | `osmo.example.com` | +| `` | Microsoft tenant ID | `12345678-1234-1234-1234-123456789abc` | +| `` | OAuth2 client/application ID | `abcd1234-...` | +| `` | OAuth2 client secret | `xxx...` | +| `` | AWS Identity Center instance ID | `ssoins-abc123def456` | +| `` | AWS region | `us-east-1`, `us-west-2` | + +--- + +## Microsoft Entra ID (Azure AD) + +### Step 1: Register an Application + +1. Go to [Azure Portal](https://portal.azure.com) → **Microsoft Entra ID** → **App registrations** +2. Click **New registration** +3. Configure the application: + - **Name**: `OSMO Service` + - **Supported account types**: Select based on your requirements + - Single tenant: Only accounts in your organization + - Multi-tenant: Accounts in any organization + - **Redirect URI**: + - Platform: **Web** + - URI: `https:///api/auth/getAToken` +4. Click **Register** +5. Note the **Application (client) ID** and **Directory (tenant) ID** + +### Step 2: Create Client Secret + +1. In your app registration, go to **Certificates & secrets** +2. Click **New client secret** +3. Add a description (e.g., `OSMO OAuth Secret`) +4. Set expiration (recommended: 24 months) +5. Click **Add** +6. **Copy the secret value immediately** (you won't be able to see it again) + +### Step 3: Configure API Permissions + +1. Go to **API permissions** +2. Click **Add a permission** → **Microsoft Graph** +3. Select **Delegated permissions** +4. Add these permissions: + - `openid` + - `profile` + - `email` + - `User.Read` +5. Click **Add permissions** +6. If you're an admin, click **Grant admin consent** + +### Step 4: Configure Token Claims (Optional) + +To include group/role information in tokens: + +1. Go to **Token configuration** +2. Click **Add groups claim** +3. Select **Security groups** (or **Groups assigned to the application**) +4. Under **ID** and **Access** tokens, select **Group ID** +5. Click **Add** + +### Step 5: Gather Endpoint URLs + +| Endpoint | URL | +|----------|-----| +| Token Endpoint | `https://login.microsoftonline.com//oauth2/v2.0/token` | +| Authorization Endpoint | `https://login.microsoftonline.com//oauth2/v2.0/authorize` | +| JWKS URI | `https://login.microsoftonline.com//discovery/v2.0/keys` | +| Issuer | `https://login.microsoftonline.com//v2.0` | + +### Microsoft Entra Values Configuration + +```yaml +sidecars: + envoy: + enabled: true + + service: + hostname: + + # OAuth2 filter for browser-based authentication + oauth2Filter: + enabled: true + tokenEndpoint: https://login.microsoftonline.com//oauth2/v2.0/token + authEndpoint: https://login.microsoftonline.com//oauth2/v2.0/authorize + clientId: + redirectPath: api/auth/getAToken + logoutPath: logout + forwardBearerToken: true + secretName: oidc-secrets + clientSecretKey: client_secret + hmacSecretKey: hmac_secret + + # JWT validation for API authentication + jwt: + user_header: x-osmo-user + providers: + # For browser-based authentication (uses v2.0 endpoint) + - issuer: https://login.microsoftonline.com//v2.0 + audience: + jwks_uri: https://login.microsoftonline.com//discovery/v2.0/keys + user_claim: preferred_username + cluster: oauth + # For service-to-service / device authentication (if using v1.0 tokens) + - issuer: https://sts.windows.net// + audience: + jwks_uri: https://login.microsoftonline.com//discovery/v2.0/keys + user_claim: unique_name + cluster: oauth +``` + +--- + +## Google OAuth2 + +### Step 1: Create OAuth 2.0 Credentials + +1. Go to [Google Cloud Console](https://console.cloud.google.com) +2. Select or create a project +3. Navigate to **APIs & Services** → **Credentials** +4. Click **Create Credentials** → **OAuth client ID** + +### Step 2: Configure OAuth Consent Screen + +If prompted, configure the OAuth consent screen first: + +1. Go to **OAuth consent screen** +2. Select **Internal** (for G Suite/Workspace) or **External** (for any Google account) +3. Fill in the required fields: + - **App name**: `OSMO Service` + - **User support email**: Your email + - **Developer contact information**: Your email +4. Click **Save and Continue** +5. Add scopes: + - `openid` + - `email` + - `profile` +6. Click **Save and Continue** + +### Step 3: Create OAuth Client ID + +1. Return to **Credentials** → **Create Credentials** → **OAuth client ID** +2. Select **Web application** +3. Configure: + - **Name**: `OSMO Web Client` + - **Authorized JavaScript origins**: `https://` + - **Authorized redirect URIs**: `https:///api/auth/getAToken` +4. Click **Create** +5. **Copy the Client ID and Client Secret** + +### Step 4: Gather Endpoint URLs + +| Endpoint | URL | +|----------|-----| +| Token Endpoint | `https://oauth2.googleapis.com/token` | +| Authorization Endpoint | `https://accounts.google.com/o/oauth2/v2/auth` | +| JWKS URI | `https://www.googleapis.com/oauth2/v3/certs` | +| Issuer | `https://accounts.google.com` | + +### Google OAuth2 Values Configuration + +```yaml +sidecars: + envoy: + enabled: true + + service: + hostname: + + # OAuth2 filter for browser-based authentication + oauth2Filter: + enabled: true + tokenEndpoint: https://oauth2.googleapis.com/token + authEndpoint: https://accounts.google.com/o/oauth2/v2/auth + clientId: .apps.googleusercontent.com + redirectPath: api/auth/getAToken + logoutPath: logout + forwardBearerToken: true + secretName: oidc-secrets + clientSecretKey: client_secret + hmacSecretKey: hmac_secret + + # JWT validation for API authentication + jwt: + user_header: x-osmo-user + providers: + - issuer: https://accounts.google.com + audience: .apps.googleusercontent.com + jwks_uri: https://www.googleapis.com/oauth2/v3/certs + user_claim: email + cluster: oauth +``` + +### Important Notes for Google OAuth2 + +1. **User Claim**: Google uses `email` as the user identifier (not `preferred_username`) +2. **Audience**: Must include the full client ID with `.apps.googleusercontent.com` suffix +3. **Domain Restriction**: For Workspace/G Suite, you can restrict to your domain in the OAuth consent screen + +--- + +## AWS IAM Identity Center (AWS SSO) + +AWS IAM Identity Center (formerly AWS SSO) is AWS's centralized identity management service. It differs from Amazon Cognito in that it's designed as an enterprise identity broker that integrates with your corporate identity provider (Microsoft Entra ID, Okta, Google Workspace, etc.) rather than a standalone user directory. + +### Key Differences from Cognito + +| Aspect | Amazon Cognito | AWS IAM Identity Center | +|--------|---------------|-------------------------| +| Primary Use Case | Customer-facing apps, mobile apps | Enterprise workforce access | +| User Directory | Managed user pool | Federated from corporate IdP | +| OIDC Support | Native | Customer managed applications | +| Typical Users | End customers | Employees, contractors | + +### Step 1: Enable AWS IAM Identity Center + +1. Go to [AWS IAM Identity Center Console](https://console.aws.amazon.com/singlesignon) +2. Click **Enable** if not already enabled +3. Choose your **Identity source**: + - **Identity Center directory**: Create and manage users directly in AWS + - **Active Directory**: Connect to on-premises or AWS Managed Microsoft AD + - **External identity provider**: Federate with Okta, Microsoft Entra ID, Google Workspace, etc. +4. Note your **Identity Center instance ARN** (format: `arn:aws:sso:::instance/ssoins-`) +5. Note your **Access Portal URL** (format: `https://.awsapps.com/start`) + +### Step 2: Configure External Identity Provider (If Using Federation) + +If federating with an external IdP: + +1. Go to **Settings** → **Identity source** → **Actions** → **Change identity source** +2. Select **External identity provider** +3. Download the **IAM Identity Center SAML metadata** file +4. In your external IdP (e.g., Okta, Microsoft Entra ID): + - Create a new SAML 2.0 application + - Upload the IAM Identity Center metadata + - Configure attribute mappings: + - `email` → user email + - `firstName` → given name + - `lastName` → family name +5. Download the IdP metadata and upload it to IAM Identity Center +6. Optionally, enable **Automatic provisioning (SCIM)** for user/group sync + +### Step 3: Create a Customer Managed Application + +1. In IAM Identity Center, go to **Applications** → **Customer managed** +2. Click **Add application** +3. Select **I have an application I want to set up** → **OAuth 2.0** +4. Configure the application: + - **Display name**: `OSMO Service` + - **Description**: OSMO workflow orchestration platform + - **Application URL**: `https://` +5. Under **OAuth 2.0 configuration**: + - **Redirect URIs**: `https:///api/auth/getAToken` + - **Grant types**: Authorization code + - **Scopes**: `openid`, `email`, `profile` +6. Click **Submit** +7. Note the **Application ARN** and **Client ID** +8. Generate and save the **Client secret** + +### Step 4: Assign Users and Groups + +1. In your application settings, go to **Assigned users and groups** +2. Click **Assign users and groups** +3. Select the users or groups who should have access to OSMO +4. Click **Assign** + +### Step 5: Gather Endpoint URLs + +| Endpoint | URL | +|----------|-----| +| Token Endpoint | `https://oidc..amazonaws.com/token` | +| Authorization Endpoint | `https://.awsapps.com/start/authorize` | +| JWKS URI | `https://oidc..amazonaws.com/keys` | +| Issuer | `https://identitycenter..amazonaws.com/ssoins-` | +| OpenID Configuration | `https://identitycenter..amazonaws.com/ssoins-/.well-known/openid-configuration` | + +> **Note**: Replace `` with your AWS region (e.g., `us-east-1`) and `` with your Identity Center instance ID. + +### AWS IAM Identity Center Values Configuration + +```yaml +sidecars: + envoy: + enabled: true + + service: + hostname: + + # OAuth2 filter for browser-based authentication + oauth2Filter: + enabled: true + tokenEndpoint: https://oidc..amazonaws.com/token + authEndpoint: https://.awsapps.com/start/authorize + clientId: + redirectPath: api/auth/getAToken + logoutPath: logout + forwardBearerToken: true + secretName: oidc-secrets + clientSecretKey: client_secret + hmacSecretKey: hmac_secret + + # JWT validation for API authentication + jwt: + user_header: x-osmo-user + providers: + - issuer: https://identitycenter..amazonaws.com/ssoins- + audience: + jwks_uri: https://oidc..amazonaws.com/keys + user_claim: email + cluster: oauth +``` + +### Important Notes for AWS IAM Identity Center + +1. **User Claim**: Identity Center uses `email` or `sub` as user identifiers +2. **Federation**: Most enterprises use Identity Center as a broker to their corporate IdP (Okta, Microsoft Entra ID, etc.) +3. **Access Portal**: Users can access `https://.awsapps.com/start` to see all assigned applications +4. **Groups**: Groups synced from your corporate IdP are available for RBAC +5. **SCIM Provisioning**: Enable automatic user/group provisioning from your corporate IdP for seamless user lifecycle management +6. **Region-Specific**: Unlike Cognito, Identity Center endpoints are region-specific + +### Migrating from Cognito to Identity Center + +If you're migrating from Amazon Cognito: + +1. **User Migration**: Export Cognito users and import them to your corporate IdP or Identity Center directory +2. **Update Envoy Configuration**: Replace Cognito endpoints with Identity Center endpoints +3. **Update Redirect URIs**: Ensure the new Identity Center application has the correct callback URLs +4. **Test Federation**: Verify that users from your corporate IdP can authenticate successfully + +--- + +## Envoy Configuration + +### Create Kubernetes Secrets + +Before deploying, create the required secrets: + +```bash +# Generate a random HMAC secret (256 bits / 32 bytes) +HMAC_SECRET=$(openssl rand -base64 32) + +# Create the secret +kubectl create secret generic oidc-secrets \ + --namespace \ + --from-literal=client_secret='' \ + --from-literal=hmac_secret="${HMAC_SECRET}" +``` + +### Complete Envoy Sidecar Configuration + +Here's the full configuration structure in your Helm values: + +```yaml +sidecars: + envoy: + enabled: true + useKubernetesSecrets: true + + image: envoyproxy/envoy:v1.29.0 + imagePullPolicy: IfNotPresent + + # Paths that skip authentication + skipAuthPaths: + - /health + - /api/router/version + - /api/version + + # Listener configuration + listenerPort: 8080 + maxHeadersSizeKb: 128 + logLevel: info + + # Service configuration + service: + port: 8000 + hostname: + address: 127.0.0.1 + + # OAuth2 filter configuration (browser flow) + oauth2Filter: + enabled: true + tokenEndpoint: + authEndpoint: + clientId: + redirectPath: api/auth/getAToken + logoutPath: logout + forwardBearerToken: true + secretName: oidc-secrets + clientSecretKey: client_secret + hmacSecretKey: hmac_secret + + # JWT validation configuration + jwt: + user_header: x-osmo-user + providers: + - issuer: + audience: + jwks_uri: + user_claim: + cluster: oauth + + # Internal OSMO auth (for service-issued tokens) + osmoauth: + enabled: true + port: 80 +``` + +### OAuth Cluster Definition + +Envoy needs a cluster definition to reach the IDP's token endpoint. This is typically auto-generated by the Helm chart, but if manual configuration is needed: + +```yaml +# In Envoy config +clusters: +- name: oauth + connect_timeout: 10s + type: LOGICAL_DNS + lb_policy: ROUND_ROBIN + dns_lookup_family: V4_ONLY + load_assignment: + cluster_name: oauth + endpoints: + - lb_endpoints: + - endpoint: + address: + socket_address: + # For Microsoft: + address: login.microsoftonline.com + # For Google: + # address: oauth2.googleapis.com + # For AWS Identity Center: + # address: oidc..amazonaws.com + port_value: 443 + transport_socket: + name: envoy.transport_sockets.tls + typed_config: + "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext + sni: login.microsoftonline.com # Match the address above +``` + +--- + +## Role Management APIs + +To manage user-to-role assignments without Keycloak, OSMO provides REST APIs for role management. These APIs replace the role assignment functionality previously handled through Keycloak's admin console or IdP group mappings. + +### Why New APIs Are Needed + +With Keycloak removed, we lose the following capabilities that must be replaced: + +| Keycloak Feature | OSMO Replacement | +|-----------------|------------------| +| Assign role to user via admin console | `POST /api/users/{username}/roles` | +| Remove role from user | `DELETE /api/users/{username}/roles/{role_name}` | +| View user's roles | `GET /api/users/{username}/roles` | +| List users with a role | `GET /api/roles/{role_name}/users` | +| Bulk role assignment via groups | `POST /api/roles/{role_name}/users` | +| Role expiration | `expires_at` field in role assignment | + +### Key Design Decisions + +1. **User-centric and Role-centric APIs** — Provide both `/api/users/{username}/roles` and `/api/roles/{role_name}/users` endpoints to support different admin workflows + +2. **Audit Trail** — Every assignment records `assigned_by` (who made the change) and `assigned_at` (when) for compliance + +3. **Time-bound Roles** — Support `expires_at` for temporary access (contractors, trials, incident response) + +4. **Idempotent Operations** — Assigning an already-assigned role returns success (not error) + +5. **Authorization Required** — Only users with `role:Manage` action can modify role assignments + +### Required Permissions + +To use the role management APIs, the caller must have the `role:Manage` action. This is included in the `osmo-admin` role by default. + +```json +{ + "statements": [ + { + "effect": "Allow", + "actions": ["role:Manage"], + "resources": ["*"] + } + ] +} +``` + +For fine-grained control, you can restrict which roles an admin can assign: + +```json +{ + "statements": [ + { + "effect": "Allow", + "actions": ["role:Manage"], + "resources": ["role/osmo-team-*"] + } + ] +} +``` + +### Database Schema for User-Role Assignments + +```sql +-- User-role assignment table +CREATE TABLE user_roles ( + id SERIAL PRIMARY KEY, + username VARCHAR(255) NOT NULL, + role_name VARCHAR(255) NOT NULL REFERENCES roles(name) ON DELETE CASCADE, + assigned_by VARCHAR(255) NOT NULL, + assigned_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(), + expires_at TIMESTAMP WITH TIME ZONE, + UNIQUE(username, role_name) +); + +-- Index for fast lookups +CREATE INDEX idx_user_roles_username ON user_roles(username); +CREATE INDEX idx_user_roles_role_name ON user_roles(role_name); +``` + +### API Endpoints + +#### List User Roles + +Get all roles assigned to a user. + +``` +GET /api/users/{username}/roles +``` + +**Response:** +```json +{ + "username": "user@example.com", + "roles": [ + { + "name": "osmo-user", + "assigned_by": "admin@example.com", + "assigned_at": "2025-01-15T10:30:00Z", + "expires_at": null + }, + { + "name": "osmo-ml-team", + "assigned_by": "admin@example.com", + "assigned_at": "2025-01-15T10:30:00Z", + "expires_at": "2025-07-15T10:30:00Z" + } + ] +} +``` + +#### Assign Role to User + +Add a role to a user. + +``` +POST /api/users/{username}/roles +``` + +**Request Body:** +```json +{ + "role_name": "osmo-ml-team", + "expires_at": "2025-07-15T10:30:00Z" // Optional +} +``` + +**Response:** +```json +{ + "username": "user@example.com", + "role_name": "osmo-ml-team", + "assigned_by": "admin@example.com", + "assigned_at": "2025-01-15T10:30:00Z", + "expires_at": "2025-07-15T10:30:00Z" +} +``` + +**Errors:** +- `400 Bad Request`: Invalid role name or user already has role +- `403 Forbidden`: Caller doesn't have permission to assign roles +- `404 Not Found`: Role doesn't exist + +#### Remove Role from User + +Remove a role assignment from a user. + +``` +DELETE /api/users/{username}/roles/{role_name} +``` + +**Response:** Success + +**Errors:** +- `403 Forbidden`: Caller doesn't have permission to remove roles +- `404 Not Found`: User doesn't have this role + +#### List All Role Assignments + +Get all users with a specific role. + +``` +GET /api/roles/{role_name}/users +``` + +**Response:** +```json +{ + "role_name": "osmo-ml-team", + "users": [ + { + "username": "user1@example.com", + "assigned_by": "admin@example.com", + "assigned_at": "2025-01-15T10:30:00Z" + }, + { + "username": "user2@example.com", + "assigned_by": "admin@example.com", + "assigned_at": "2025-01-16T09:00:00Z" + } + ] +} +``` + +#### Bulk Role Assignment + +Assign a role to multiple users at once. + +``` +POST /api/roles/{role_name}/users +``` + +**Request Body:** +```json +{ + "usernames": [ + "user1@example.com", + "user2@example.com", + "user3@example.com" + ], + "expires_at": "2025-12-31T23:59:59Z" // Optional +} +``` + +**Response:** +```json +{ + "role_name": "osmo-ml-team", + "assigned": [ + "user1@example.com", + "user2@example.com" + ], + "already_assigned": [ + "user3@example.com" + ], + "failed": [] +} +``` + +### Role Resolution in authz_sidecar + +When a request comes in, the authz_sidecar resolves user roles from multiple sources: + +1. **JWT Token Claims**: Roles from the IDP (e.g., `groups` claim in Azure AD) +2. **Database Lookup**: Roles from the `user_roles` table +3. **Default Roles**: `osmo-default` for unauthenticated, `osmo-user` for authenticated + +```go +func (s *AuthzServer) resolveRoles(ctx context.Context, username string, jwtRoles []string) ([]string, error) { + // Start with roles from JWT + roles := make(map[string]bool) + for _, r := range jwtRoles { + roles[r] = true + } + + // Add roles from database + dbRoles, err := s.postgresClient.GetUserRoles(ctx, username) + if err != nil { + return nil, err + } + for _, r := range dbRoles { + roles[r.RoleName] = true + } + + // Convert to slice + result := make([]string, 0, len(roles)) + for r := range roles { + result = append(result, r) + } + return result, nil +} +``` + +### Action Registry Addition + +Add the new role management actions to the action registry: + +**File**: `external/src/service/authz_sidecar/server/action_registry.go` + +```go +// Add to ActionRegistry map +"role:Manage": { + {Path: "/api/users/*/roles", Methods: []string{"POST", "DELETE"}}, + {Path: "/api/users/*/roles/*", Methods: []string{"DELETE"}}, + {Path: "/api/roles/*/users", Methods: []string{"POST"}}, +}, +"role:Read": { + {Path: "/api/users/*/roles", Methods: []string{"GET"}}, + {Path: "/api/roles/*/users", Methods: []string{"GET"}}, +}, +``` + +### CLI Commands [WIP] + +The OSMO CLI provides commands for managing user roles: + +```bash +# List roles for a user +osmo user roles list user@example.com + +# Assign a role to a user +osmo user roles add user@example.com --role osmo-ml-team + +# Assign a role with expiration +osmo user roles add user@example.com --role osmo-ml-team --expires 2025-12-31 + +# Remove a role from a user +osmo user roles remove user@example.com --role osmo-ml-team + +# List all users with a role +osmo role users list osmo-ml-team + +# Bulk assign a role to multiple users +osmo role users add osmo-ml-team --users user1@example.com,user2@example.com +``` + +### Database Migration + +WIP + +--- + +## Verification Steps + +### Step 1: Verify OAuth2 Endpoints + +Test that the IDP endpoints are reachable from your cluster: + +```bash +# Test Microsoft Entra ID +curl -s "https://login.microsoftonline.com//v2.0/.well-known/openid-configuration" | jq . + +# Test Google +curl -s "https://accounts.google.com/.well-known/openid-configuration" | jq . + +# Test AWS IAM Identity Center +curl -s "https://identitycenter..amazonaws.com/ssoins-/.well-known/openid-configuration" | jq . +``` + +**Expected Output**: JSON document with `authorization_endpoint`, `token_endpoint`, `jwks_uri`, etc. + +### Step 2: Verify JWKS Endpoint + +```bash +# Test that JWKS is accessible +curl -s "" | jq '.keys[0].kid' +``` + +**Expected Output**: A key ID string (e.g., `"nOo3ZDrODXEK1jKWhXslHR_KXEg"`) + +### Step 3: Test Browser Authentication Flow + +1. Open a browser in incognito/private mode +2. Navigate to `https://` +3. You should be redirected to your IDP login page +4. After logging in, you should be redirected back to OSMO +5. Check browser developer tools → Network tab for: + - Redirect to IDP authorization endpoint + - Callback to `/api/auth/getAToken` with authorization code + - Cookie set with session token + +### Step 4: Verify JWT Token + +After logging in, inspect your token: + +```bash +# Decode manually +echo "" | cut -d. -f2 | base64 -d 2>/dev/null | jq . +``` + +**Expected Claims:** +```json +{ + "iss": "https://login.microsoftonline.com//v2.0", + "sub": "user-id", + "aud": "", + "preferred_username": "user@example.com", + "exp": 1705320000 +} +``` + +### Step 5: Test API Authentication + +```bash +# Test authenticated API call +curl -H "Authorization: Bearer " \ + "https:///api/version" + +# Should return version info if authenticated +``` + +### Step 6: Verify Role Resolution + +```bash +# Check user roles via API +curl -H "Authorization: Bearer " \ + "https:///api/users//roles" +``` + +--- + +## Migration from Keycloak + +### Phase 1: Parallel Operation + +1. **Configure direct IDP** alongside Keycloak +2. Add a second JWT provider in Envoy configuration: + +```yaml +jwt: + providers: + # Existing Keycloak provider + - issuer: https://auth-/realms/osmo + audience: osmo-browser-flow + jwks_uri: https://auth-/realms/osmo/protocol/openid-connect/certs + user_claim: preferred_username + cluster: keycloak + # New direct IDP provider + - issuer: https://login.microsoftonline.com//v2.0 + audience: + jwks_uri: https://login.microsoftonline.com//discovery/v2.0/keys + user_claim: preferred_username + cluster: oauth +``` + +## Future IDPs + +We plan on supporting these other IDPs in the future, but might be beyond the scope of the inital feature. + +* Okta +* auth0 + +## Troubleshooting + +### Authentication Fails with "Invalid Token" + +**Symptoms**: 401 Unauthorized with "JWT verification failed" + +**Solutions**: + +1. **Check issuer mismatch**: + ```bash + # Decode token and compare issuer + echo "" | cut -d. -f2 | base64 -d | jq .iss + ``` + The issuer in the token must exactly match the `issuer` in your Envoy configuration. + +2. **Check audience mismatch**: + ```bash + echo "" | cut -d. -f2 | base64 -d | jq .aud + ``` + The audience must match your `audience` configuration. + +3. **Check JWKS connectivity**: + ```bash + kubectl exec -it -c envoy -- \ + curl -v "" + ``` + +### OAuth2 Redirect Fails + +**Symptoms**: Browser shows error after IDP login, redirect back fails + +**Solutions**: + +1. **Verify redirect URI** in IDP matches exactly: + - Check for trailing slashes + - Check for HTTP vs HTTPS + - Check hostname matches + +2. **Check Envoy logs**: + ```bash + kubectl logs -c envoy | grep -i oauth + ``` + +3. **Verify secrets exist**: + ```bash + kubectl get secret oidc-secrets -o yaml + ``` + +### User Has No Roles + +**Symptoms**: User authenticates but gets 403 Forbidden + +**Solutions**: + +1. **Check x-osmo-user header**: + ```bash + # In Envoy access logs, verify the user header is set + kubectl logs -c envoy | grep x-osmo-user + ``` + +2. **Verify user_claim configuration**: + - Microsoft: `preferred_username` or `unique_name` + - Google: `email` + - AWS Identity Center: `email` or `sub` + +3. **Check database for role assignments**: + ```sql + SELECT * FROM user_roles WHERE username = 'user@example.com'; + ``` + +### Session Cookie Not Set + +**Symptoms**: User has to log in on every request + +**Solutions**: + +1. **Check cookie settings**: + - Verify `SameSite` attribute allows cross-site if needed + - Verify `Secure` attribute matches HTTPS usage + +2. **Check HMAC secret**: + - Verify the secret hasn't changed + - Verify the secret is properly base64 encoded + +### Envoy Can't Reach IDP + +**Symptoms**: Timeouts or connection refused errors + +**Solutions**: + +1. **Check network policies**: + ```bash + kubectl get networkpolicies + ``` + +2. **Test from pod**: + ```bash + kubectl exec -it -- curl -v https://login.microsoftonline.com + ``` + +3. **Check DNS resolution**: + ```bash + kubectl exec -it -- nslookup login.microsoftonline.com + ``` + +--- + +## Quick Reference + +### Endpoint Summary + +| Provider | Token Endpoint | Auth Endpoint | JWKS URI | Issuer | +|----------|---------------|---------------|----------|--------| +| Microsoft | `https://login.microsoftonline.com//oauth2/v2.0/token` | `https://login.microsoftonline.com//oauth2/v2.0/authorize` | `https://login.microsoftonline.com//discovery/v2.0/keys` | `https://login.microsoftonline.com//v2.0` | +| Google | `https://oauth2.googleapis.com/token` | `https://accounts.google.com/o/oauth2/v2/auth` | `https://www.googleapis.com/oauth2/v3/certs` | `https://accounts.google.com` | +| AWS Identity Center | `https://oidc..amazonaws.com/token` | `https://.awsapps.com/start/authorize` | `https://oidc..amazonaws.com/keys` | `https://identitycenter..amazonaws.com/ssoins-` | + +### User Claim Mapping + +| Provider | Common Claims | +|----------|--------------| +| Microsoft | `preferred_username`, `unique_name`, `email`, `upn` | +| Google | `email`, `name`, `sub` | +| AWS Identity Center | `email`, `sub`, `name` | diff --git a/projects/auth/PROJ-148-resource-action-model.md b/projects/auth/PROJ-148-resource-action-model.md new file mode 100644 index 00000000..413cd510 --- /dev/null +++ b/projects/auth/PROJ-148-resource-action-model.md @@ -0,0 +1,837 @@ + + +# Resource-Action Permission Model Design + +**Author**: @RyaliNvidia
+**PIC**: @RyaliNvidia
+**Proposal Issue**: [#148](https://github.com/NVIDIA/OSMO/issues/148) + +## Overview + +This document describes the design for a resource-action permission model for OSMO authorization. +The model decouples authorization policies from specific API paths by using semantic actions similar +to AWS IAM policies, improving maintainability, auditability, and developer experience. + +### Motivation + +- **Simplify policy management** — Define permissions in terms of what users can do (e.g., "create workflows") rather than which URLs they can access +- **Improve auditability** — Make it easy to understand what actions a role grants without tracing API paths +- **Future proofing** — The same action can have the underlying apis change without requiring the role actions to change + +### Problem + +Currently, role policies directly reference API paths: + +```python +role.RoleAction(base='http', path='/api/workflow/*', method='*') +role.RoleAction(base='http', path='/api/bucket', method='*') +``` + +This approach has several limitations: + +1. **Tight coupling** — Policies break when API paths change (e.g., `/api/workflow` → `/api/v2/workflow`) +2. **Redundancy** — Multiple paths often represent the same logical action (e.g., `GET /api/workflow` and `GET /api/workflow/*` both mean "read workflows") +3. **Complexity** — Path patterns become complex with wildcards and deny patterns (`!/api/agent/*`) +4. **Maintainability** — Hard to audit what actions a role can actually perform; requires tracing paths to understand permissions +5. **No semantic meaning** — `/api/workflow/*` doesn't convey whether it allows create, read, update, delete, or all operations + +## Use Cases + +| Use Case | Description | +|---|---| +| Define a read-only role | Admin creates a role that can view workflows and tasks but cannot create, modify, or delete them using `workflow:Read` action | +| Grant workflow management | User role includes `workflow:*` to allow all workflow operations (create, read, update, delete, cancel, clone) | +| Restrict pool deletion | Admin creates a policy with `Deny` effect on `pool:Delete` for production pools to prevent accidental deletion | +| Backend service access | Internal services use `internal:Operator` action to access agent APIs without exposing those endpoints to regular users | +| Audit role permissions | Admin reviews a role's policy and immediately understands what it allows (e.g., `bucket:Read`, `bucket:Write`) without tracing API paths | +| Add new API endpoint | Developer adds `POST /api/workflow/{id}/archive` and corresponding `workflow:Archive` action in the same PR | + +## Requirements + +| Title | Description | Type | +|---|---|---| +| Semantic action model | Policies shall reference semantic actions (e.g., `workflow:Create`) instead of API paths | Functional | +| AWS IAM-style policies | Policies shall support Allow/Deny effects with action and resource matching | Functional | +| Wildcard support | Policies shall support wildcards for actions (`workflow:*`, `*:Read`, `*:*`) and resources (`*`) | Functional | +| Deny precedence | If any policy statement denies an action, access shall be denied regardless of Allow statements | Functional | +| Code-defined action registry | Action-to-path mappings shall be defined in code (not database) for compile-time safety | Functional | +| Dynamic policy updates | Role policies shall be updatable at runtime via API without requiring deployment | Functional | +| Backward compatibility | The new model shall support running alongside the existing path-based model during migration | Functional | +| Authorization latency | Authorization checks shall complete in <5ms at p99 (cached) | KPI | +| Policy validation | When creating/updating a role, the system shall validate that referenced actions exist in the registry | Security | +| Immutable default roles | Default system roles (osmo-admin, osmo-user, etc.) shall be protected from modification | Security | + +--- + +## Architectural Details + +### High-Level Architecture + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ RESOURCE-ACTION PERMISSION MODEL │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ LAYER 1: Action Registry (Static, Code-defined) │ +│ ───────────────────────────────────────────────────────────────────────── │ +│ Immutable mapping of actions → API paths │ +│ Changes require code update + deployment │ +│ Example: workflow:Create → POST /api/workflow │ +│ │ +│ LAYER 2: Policy Engine (Dynamic, DB-stored) │ +│ ───────────────────────────────────────────────────────────────────────── │ +│ AWS-style policies granting actions on resources │ +│ Can be updated at runtime via API │ +│ Format: { "effect": "Allow", "actions": [...], "resources": [...] } │ +│ │ +│ LAYER 3: Role Assignments (Dynamic, DB-stored) │ +│ ───────────────────────────────────────────────────────────────────────── │ +│ Maps users to roles │ +│ Can be updated at runtime via API │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### Why Code-Defined Action Registry? + +The action registry is defined in code (not database) for several reasons: + +1. **Actions are tied to API code** — When you add a new API endpoint, you add the action in the same PR +2. **Prevents accidental/malicious action creation** — Only developers with code access can define new actions +3. **Compile-time safety** — Go constants for actions provide autocomplete, type-checking, and catch typos +4. **Audit trail via Git** — All changes tracked in git history with commit messages +5. **Simpler implementation** — No need for a separate `action_registry` table + +--- + +## Resource-Action Model + +### Resources and Actions + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ RESOURCE-ACTION MODEL │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ RESOURCES ACTIONS SCOPED TO │ +│ ───────── ─────── ───────── │ +│ pool List (global) │ +│ │ +│ workflow Create, Read, Update, pool / user │ +│ Delete, Cancel, Clone, │ +│ List, Execute │ +│ │ +│ task Read, Update, Cancel, pool / user │ +│ Exec, PortForward, Rsync │ +│ │ +│ bucket Create, Read, Write, bucket │ +│ Delete, List │ +│ │ +│ credentials Create, Read, Update, (global) │ +│ Delete, List │ +│ │ +│ profile Read, Update user │ +│ │ +│ user List (global) │ +│ │ +│ app Create, Read, Update, (global) │ +│ Delete, List │ +│ │ +│ config Read, Update config │ +│ │ +│ system Health, Version (public) │ +│ │ +│ internal Operator, Logger, Router backend / workflow │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### Action Constants + +Actions are defined as Go constants for compile-time safety: + +```go +const ( + // Workflow actions + ActionWorkflowCreate = "workflow:Create" + ActionWorkflowRead = "workflow:Read" + ActionWorkflowUpdate = "workflow:Update" + ActionWorkflowDelete = "workflow:Delete" + ActionWorkflowCancel = "workflow:Cancel" + ActionWorkflowList = "workflow:List" + ActionWorkflowExecute = "workflow:Execute" + ActionWorkflowExec = "workflow:Exec" + ActionWorkflowPortForward = "workflow:PortForward" + ActionWorkflowRsync = "workflow:Rsync" + + // Bucket actions + ActionBucketCreate = "bucket:Create" + ActionBucketRead = "bucket:Read" + ActionBucketWrite = "bucket:Write" + ActionBucketDelete = "bucket:Delete" + ActionBucketList = "bucket:List" + + // Pool actions + ActionPoolList = "pool:List" + + // Internal/Backend actions (restricted) + ActionInternalOperator = "internal:Operator" + ActionInternalLogger = "internal:Logger" + ActionInternalRouter = "internal:Router" + + // Config actions + ActionConfigRead = "config:Read" + ActionConfigUpdate = "config:Update" + + // System actions (public) + ActionSystemHealth = "system:Health" + ActionSystemVersion = "system:Version" +) +``` + +### Resource Naming Convention + +``` +/ + +Examples: + workflow/* - All workflows + workflow/abc123 - Specific workflow + pool/default - Default pool + pool/production/* - Production pool and children + bucket/data-generation - Bucket for storing data generation datasets + backend/gb200-testing - Backend called gb200-testing + config/service - Service Config + * - All resources +``` + +--- + +## Policy Format + +Policies use AWS IAM-style JSON format with Allow/Deny statements: + +```json +{ + "statements": [ + { + "effect": "Allow", + "actions": [ + "workflow:Create", + "workflow:Read", + "workflow:Update", + "workflow:Delete", + "workflow:Cancel" + ], + "resources": ["*"] + }, + { + "effect": "Allow", + "actions": ["bucket:*"], + "resources": ["pool/default/*"] + }, + { + "effect": "Deny", + "actions": ["pool:Delete"], + "resources": ["pool/production"] + } + ] +} +``` + +### Wildcard Support + +- `workflow:*` — All workflow actions +- `*:Read` — Read action on all resources +- `*:*` — All actions on all resources +- `*` in resources — All resources + +### Effect Precedence + +**Deny always wins.** If any statement denies an action, access is denied regardless of Allow statements. + +--- + +## Default Roles + +### osmo-admin + +Full access except internal backend endpoints: + +```json +{ + "name": "osmo-admin", + "description": "Administrator with full access except internal endpoints", + "policy": { + "statements": [ + {"effect": "Allow", "actions": ["*:*"], "resources": ["*"]}, + {"effect": "Deny", "actions": ["internal:*"], "resources": ["*"]} + ] + }, + "immutable": true +} +``` + +### osmo-user + +Standard user role: + +```json +{ + "name": "osmo-user", + "description": "Standard user role", + "policy": { + "statements": [ + { + "effect": "Allow", + "actions": [ + "workflow:*", + "bucket:*", + "credentials:*", + "profile:Read", "profile:Update", + "pool:Read", + "user:List", + "app:*", + "resources:Read", + "config:Read", + "auth:Token", + "router:Client", + "system:*" + ], + "resources": ["*"] + } + ] + }, + "immutable": false +} +``` + +### osmo-viewer (example new role) + +Read-only access: + +```json +{ + "name": "osmo-viewer", + "description": "Read-only access to workflows", + "policy": { + "statements": [ + { + "effect": "Allow", + "actions": [ + "workflow:Read", "workflow:List", + "bucket:Read", "bucket:List", + "system:*" + ], + "resources": ["*"] + } + ] + }, + "immutable": false +} +``` + +### osmo-backend + +For backend agents: + +```json +{ + "name": "osmo-backend", + "description": "For backend agents", + "policy": { + "statements": [ + { + "effect": "Allow", + "actions": ["internal:Operator", "pool:Read", "config:Read"], + "resources": ["backend/*", "pool/*", "config/backend"] + } + ] + }, + "immutable": true +} +``` + +### osmo-ctrl + +For workflow pods: + +```json +{ + "name": "osmo-ctrl", + "description": "For workflow pods", + "policy": { + "statements": [ + { + "effect": "Allow", + "actions": ["internal:Logger", "internal:Router"], + "resources": ["*"] + } + ] + }, + "immutable": true +} +``` + +### osmo-default + +Minimal access for unauthenticated users: + +```json +{ + "name": "osmo-default", + "description": "Default role for unauthenticated access", + "policy": { + "statements": [ + { + "effect": "Allow", + "actions": ["system:Health", "system:Version", "auth:Login", "auth:Refresh", "auth:Token"], + "resources": ["*"] + } + ] + }, + "immutable": true +} +``` + +--- + +## Authorization Flow + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ AUTHORIZATION FLOW │ +└─────────────────────────────────────────────────────────────────────────────┘ + + 1. Request arrives: POST /api/workflow/abc123/cancel + + 2. Path Resolution: + ┌───────────────────────────────────────────────────────────────────┐ + │ Request: POST /api/workflow/abc123/cancel │ + │ ↓ │ + │ Match against ActionRegistry patterns │ + │ ↓ │ + │ Resolved: workflow:Cancel on resource workflow/abc123 │ + └───────────────────────────────────────────────────────────────────┘ + + 3. Policy Evaluation: + ┌───────────────────────────────────────────────────────────────────┐ + │ User roles: [osmo-user, custom-role] │ + │ ↓ │ + │ Collect all policy statements from roles │ + │ ↓ │ + │ Evaluate workflow:Cancel against statements │ + │ ↓ │ + │ Check: Does any Deny statement match? (if yes → DENY) │ + │ Check: Does any Allow statement match? (if yes → ALLOW) │ + │ ↓ │ + │ No match → DENY (implicit deny) │ + └───────────────────────────────────────────────────────────────────┘ +``` + +### Authorization Algorithm + +```go +func (e *PolicyEvaluator) CheckAccess(ctx context.Context, req AuthzRequest) (bool, error) { + // Step 1: Resolve API path to action(s) + actions := e.registry.ResolvePathToActions(req.Path, req.Method) + if len(actions) == 0 { + return false, nil // No action mapping = deny + } + + // Step 2: Get user's roles and their policies + policies, err := e.getPoliciesForRoles(ctx, req.Roles) + if err != nil { + return false, err + } + + // Step 3: Evaluate each resolved action + for _, action := range actions { + resource := e.extractResource(req.Path, action) + + // Check for explicit Deny first (Deny always wins) + for _, policy := range policies { + if e.matchesDenyStatement(policy, action, resource) { + return false, nil + } + } + + // Check for Allow + allowed := false + for _, policy := range policies { + if e.matchesAllowStatement(policy, action, resource) { + allowed = true + break + } + } + + if !allowed { + return false, nil + } + } + + return true, nil +} + +func (e *PolicyEvaluator) actionMatches(patterns []string, action string) bool { + for _, pattern := range patterns { + if pattern == "*:*" || pattern == action { + return true + } + // Handle wildcards: "workflow:*" matches "workflow:Create" + if strings.HasSuffix(pattern, ":*") { + prefix := strings.TrimSuffix(pattern, ":*") + if strings.HasPrefix(action, prefix+":") { + return true + } + } + // Handle resource wildcards: "*:Read" matches "workflow:Read" + if strings.HasPrefix(pattern, "*:") { + suffix := strings.TrimPrefix(pattern, "*:") + if strings.HasSuffix(action, ":"+suffix) { + return true + } + } + } + return false +} +``` + +--- + +## Implementation + +### New Files + +| File | Description | +|------|-------------| +| `external/src/service/authz_sidecar/server/action_registry.go` | Action constants and path mappings | +| `external/src/service/authz_sidecar/server/policy_evaluator.go` | Policy evaluation logic | +| `external/src/service/authz_sidecar/server/action_registry_test.go` | Unit tests for registry | +| `external/src/service/authz_sidecar/server/policy_evaluator_test.go` | Unit tests for evaluator | + +### Database Schema Changes + +```sql +-- Option 1: Modify existing column +ALTER TABLE roles +ALTER COLUMN policies TYPE JSONB +USING policies[1]; + +-- Option 2: Add new column for v2 policies (safer for migration) +ALTER TABLE roles +ADD COLUMN policy_v2 JSONB; + +-- Add index for policy queries +CREATE INDEX idx_roles_policy_v2 ON roles USING GIN (policy_v2); +``` + +### Changes to authz_server.go + +1. Add path-to-action resolution using ActionRegistry +2. Replace direct path matching with action-based policy evaluation +3. Support wildcard matching for actions (`workflow:*`, `*:*`) +4. Implement Deny precedence logic + +--- + +## Migration Strategy + +### Phase 1: Add New System (Parallel) + +1. Implement action registry in authz_sidecar +2. Add policy evaluator supporting both old and new formats +3. Store new-format policies in `policy_v2` column +4. Default roles continue using old format + +### Phase 2: Migrate Default Roles + +1. Create new-format policies for all default roles +2. Add feature flag to switch between old/new evaluation +3. Test extensively in development + +### Phase 3: Migrate Custom Roles + +1. Write migration script to convert existing policies: + ```python + # Convert old format + {"actions": [{"base": "http", "path": "/api/workflow/*", "method": "*"}]} + + # To new format (using path-to-action reverse lookup) + {"statements": [{"effect": "Allow", "actions": ["workflow:*"], "resources": ["*"]}]} + ``` +2. Run migration +3. Validate all access patterns unchanged + +### Phase 4: Deprecate Old Format + +1. Remove old policy evaluation code +2. Drop old `policies` column +3. Rename `policy_v2` to `policy` +4. Update documentation + +--- + +## Backwards Compatibility + +The new resource-action model is designed to run alongside the existing path-based model during migration: + +- **Parallel evaluation** — Both old and new policy formats can be evaluated simultaneously using a feature flag +- **No breaking changes to existing roles** — Existing path-based policies continue to work until explicitly migrated +- **Gradual migration** — Roles can be migrated one at a time; mixed environments are supported +- **Rollback capability** — Feature flag allows instant rollback to old evaluation if issues arise + +Once migration is complete (Phase 4), the old format will be removed. This is a one-way migration with no long-term backward compatibility for the old format. + +--- + +## Performance + +No significant performance implications are expected + +--- + +## Operations + +No significant operational changes) + +--- + +## Security + +No new security concerns introduced + +--- + +## Documentation + +The following documentation will need to be created or updated: + +| Document | Action | +|---|---| +| Role management guide | Update to describe new policy format with examples | +| API reference for `/api/roles` | Update request/response schemas for new policy format | +| Default roles reference | Document what actions each default role grants | +| Action reference | New document listing all available actions and their meanings | +| Migration guide | New document for customers migrating custom roles | + +--- + +## Testing + +### Unit Tests [Not added yet] + +- `action_registry_test.go` — Test path-to-action resolution for all registered actions +- `policy_evaluator_test.go` — Test Allow/Deny evaluation, wildcard matching, deny precedence + +### Integration Tests + +- Verify all existing access patterns work identically with new model +- Test migration script converts policies correctly +- Test feature flag switching between old and new evaluation + +### Test Metrics + +| Metric | Target | +|---|---| +| Unit test coverage | >90% for action_registry.go and policy_evaluator.go | +| Integration test pass rate | 100% | +| Migration accuracy | 100% of existing roles produce identical access decisions | + +--- + +## Complete Action Registry + +```go +package server + +// ActionRegistry maps resource:action pairs to API endpoint patterns +var ActionRegistry = map[string][]EndpointPattern{ + // ==================== WORKFLOW ==================== + "workflow:Create": { + {Path: "/api/workflow", Methods: []string{"POST"}}, + }, + "workflow:Read": { + {Path: "/api/workflow", Methods: []string{"GET"}}, + {Path: "/api/workflow/*", Methods: []string{"GET"}}, + {Path: "/api/task", Methods: []string{"GET"}}, + {Path: "/api/task/*", Methods: []string{"GET"}}, + {Path: "/api/tag", Methods: []string{"GET"}}, + }, + "workflow:Update": { + {Path: "/api/workflow/*", Methods: []string{"PUT", "PATCH"}}, + }, + "workflow:Delete": { + {Path: "/api/workflow/*", Methods: []string{"DELETE"}}, + }, + "workflow:Cancel": { + {Path: "/api/workflow/*/cancel", Methods: []string{"POST"}}, + }, + "workflow:Exec": { + {Path: "/api/workflow/*/exec", Methods: []string{"POST", "WEBSOCKET"}}, + }, + "workflow:PortForward": { + {Path: "/api/workflow/*/portforward/*", Methods: []string{"*"}}, + }, + "workflow:Rsync": { + {Path: "/api/workflow/*/rsync", Methods: []string{"POST"}}, + }, + + // ==================== BUCKET ==================== + "bucket:Create": { + {Path: "/api/bucket", Methods: []string{"POST"}}, + }, + "bucket:Read": { + {Path: "/api/bucket", Methods: []string{"GET"}}, + {Path: "/api/bucket/*", Methods: []string{"GET"}}, + }, + "bucket:Write": { + {Path: "/api/bucket/*", Methods: []string{"POST", "PUT"}}, + }, + "bucket:Delete": { + {Path: "/api/bucket/*", Methods: []string{"DELETE"}}, + }, + + // ==================== POOL ==================== + "pool:Delete": { + {Path: "/api/pool/*", Methods: []string{"DELETE"}}, + }, + + // ==================== CREDENTIALS ==================== + "credentials:Create": { + {Path: "/api/credentials", Methods: []string{"POST"}}, + }, + "credentials:Read": { + {Path: "/api/credentials", Methods: []string{"GET"}}, + {Path: "/api/credentials/*", Methods: []string{"GET"}}, + }, + "credentials:Update": { + {Path: "/api/credentials/*", Methods: []string{"PUT", "PATCH"}}, + }, + "credentials:Delete": { + {Path: "/api/credentials/*", Methods: []string{"DELETE"}}, + }, + + // ==================== PROFILE ==================== + "profile:Read": { + {Path: "/api/profile/*", Methods: []string{"GET"}}, + }, + "profile:Update": { + {Path: "/api/profile/*", Methods: []string{"PUT", "PATCH"}}, + }, + + // ==================== USER ==================== + "user:List": { + {Path: "/api/users", Methods: []string{"GET"}}, + }, + + // ==================== APP ==================== + "app:Create": { + {Path: "/api/app", Methods: []string{"POST"}}, + }, + "app:Read": { + {Path: "/api/app", Methods: []string{"GET"}}, + {Path: "/api/app/*", Methods: []string{"GET"}}, + }, + "app:Update": { + {Path: "/api/app/*", Methods: []string{"PUT", "PATCH"}}, + }, + "app:Delete": { + {Path: "/api/app/*", Methods: []string{"DELETE"}}, + }, + + // ==================== RESOURCES ==================== + "resources:Read": { + {Path: "/api/resources", Methods: []string{"GET"}}, + {Path: "/api/resources/*", Methods: []string{"GET"}}, + }, + + // ==================== CONFIG ==================== + "config:Read": { + {Path: "/api/configs/*", Methods: []string{"GET"}}, + }, + "config:Update": { + {Path: "/api/configs/*", Methods: []string{"PUT", "PATCH"}}, + }, + + // ==================== AUTH ==================== + "auth:Login": { + {Path: "/api/auth/login", Methods: []string{"GET"}}, + {Path: "/api/auth/keys", Methods: []string{"GET"}}, + }, + "auth:Refresh": { + {Path: "/api/auth/refresh_token", Methods: []string{"*"}}, + {Path: "/api/auth/jwt/refresh_token", Methods: []string{"*"}}, + {Path: "/api/auth/jwt/access_token", Methods: []string{"*"}}, + }, + "auth:Token": { + {Path: "/api/auth/access_token", Methods: []string{"*"}}, + {Path: "/api/auth/access_token/user", Methods: []string{"*"}}, + {Path: "/api/auth/access_token/user/*", Methods: []string{"*"}}, + }, + "auth:ServiceToken": { + {Path: "/api/auth/access_token", Methods: []string{"*"}}, + {Path: "/api/auth/access_token/service", Methods: []string{"*"}}, + {Path: "/api/auth/access_token/service/*", Methods: []string{"*"}}, + }, + + // ==================== ROUTER ==================== + // This one is still WIP to see if it is needed or can be merged + "router:Client": { + {Path: "/api/router/webserver/*/", Methods: []string{"*"}}, + {Path: "/api/router/webserver_enabled", Methods: []string{"*"}}, + {Path: "/api/router/*/*/client/*", Methods: []string{"*"}}, + }, + + // ==================== SYSTEM (PUBLIC) ==================== + "system:Health": { + {Path: "/health", Methods: []string{"*"}}, + }, + "system:Version": { + {Path: "/api/version", Methods: []string{"*"}}, + {Path: "/api/router/version", Methods: []string{"*"}}, + {Path: "/client/version", Methods: []string{"*"}}, + }, + + // ==================== INTERNAL (RESTRICTED) ==================== + "internal:Operator": { + {Path: "/api/agent/listener/*", Methods: []string{"*"}}, + {Path: "/api/agent/worker/*", Methods: []string{"*"}}, + }, + "internal:Logger": { + {Path: "/api/logger/workflow/*", Methods: []string{"*"}}, + }, + "internal:Router": { + {Path: "/api/router/*/*/backend/*", Methods: []string{"*"}}, + }, +} +``` + +--- + +## Next Steps + +1. **Implement action registry** — Create `action_registry.go` with constants and mappings +2. **Implement policy evaluator** — Create `policy_evaluator.go` with AWS-style evaluation +3. **Add unit tests** — Test path resolution and policy matching +4. **Update authz_server.go** — Integrate new evaluator with feature flag +5. **Create migration script** — Convert existing roles to new format +6. **Test in development** — Validate all existing access patterns +7. **Roll out to staging** — Enable feature flag and monitor +8. **Document** — User-facing documentation for new policy format