Skip to content

Add Early Input Validation in CLI #16

@GenCEO

Description

@GenCEO

The CLI accepts a config file but doesn't validate it until late in the execution pipeline. This means:

  • Users wait through expensive setup (model loading, parsing) only to fail on invalid config
  • Error messages appear after significant processing
  • No quick feedback loop

Current Flow

1. User runs: sdg generate config.yaml
2. Config file loaded
3. Document parsing starts (slow!)
4. Model initialization begins
5. ❌ ERROR: Invalid generation config

User wasted 2-3 minutes before seeing the error.

Proposed Flow

1. User runs: sdg generate config.yaml
2. ✅ Config validation (< 1 second)
3. ❌ ERROR: Invalid generation config
   Exit immediately

Implementation

Add Validation Command

# In cli.py
@app.command()
def validate(config_path: Path):
    """Validate configuration file without running."""
    try:
        config = SDGConfig.from_yaml(config_path)
        console.print("[green]✓ Configuration is valid[/green]")

        # Show warnings
        if config.generation.num_samples > 10000:
            console.print("[yellow]⚠ Large sample count may take hours[/yellow]")

        # Estimate resources
        estimated_tokens = estimate_token_usage(config)
        estimated_cost = estimated_tokens * 0.00001  # Example pricing
        console.print(f"[blue]Estimated cost: ${estimated_cost:.2f}[/blue]")

    except ValidationError as e:
        console.print(f"[red]✗ Invalid configuration:[/red]")
        for error in e.errors():
            console.print(f"  {error['loc']}: {error['msg']}")
        sys.exit(1)

Enhance Main Command

@app.command()
def generate(
    config_path: Path,
    validate_only: bool = typer.Option(False, "--validate-only", help="Only validate config")
):
    """Generate synthetic data."""

    # Early validation
    try:
        config = SDGConfig.from_yaml(config_path)
    except ValidationError as e:
        console.print("[red]Configuration errors:[/red]")
        for error in e.errors():
            field = " → ".join(str(loc) for loc in error['loc'])
            console.print(f"  [yellow]{field}[/yellow]: {error['msg']}")
        raise typer.Exit(1)

    if validate_only:
        console.print("[green]✓ Configuration is valid[/green]")
        raise typer.Exit(0)

    # Check prerequisites
    if config.task.method == "local" and not config.task.document_path:
        console.print("[red]Error: document_path required for local method[/red]")
        raise typer.Exit(1)

    if config.task.method == "web" and not config.task.dataset_id:
        console.print("[red]Error: dataset_id required for web method[/red]")
        raise typer.Exit(1)

    # File existence checks
    if config.task.method == "local":
        doc_path = Path(config.task.document_path)
        if not doc_path.exists():
            console.print(f"[red]Error: Document not found: {doc_path}[/red]")
            raise typer.Exit(1)

    # Continue with execution...

Add Pre-flight Checks

def validate_environment(config: SDGConfig) -> list[str]:
    """Check environment prerequisites."""
    warnings = []

    # Check GPU availability
    if config.model.device == "cuda" and not torch.cuda.is_available():
        warnings.append("CUDA requested but not available. Falling back to CPU.")

    # Check API keys
    if config.model.provider == "openai" and not os.getenv("OPENAI_API_KEY"):
        raise ValueError("OPENAI_API_KEY not set")

    if config.model.provider == "anthropic" and not os.getenv("ANTHROPIC_API_KEY"):
        raise ValueError("ANTHROPIC_API_KEY not set")

    # Check disk space
    cache_dir = Path(".cache")
    stat = os.statvfs(cache_dir)
    free_gb = (stat.f_bavail * stat.f_frsize) / (1024**3)
    if free_gb < 10:
        warnings.append(f"Low disk space: {free_gb:.1f}GB free")

    return warnings

Example Output

$ sdg generate config.yaml

⚙ Validating configuration...
✓ Config structure valid
✓ Model configuration valid
✓ Task configuration valid

⚠ Warnings:
  • Large sample count (10000) - estimated 2-3 hours
  • GPU not available - using CPU (slower)

💰 Cost Estimate:
  • Tokens: ~500,000
  • Estimated cost: $5.00 (gpt-4)

📊 Resource Estimate:
  • Time: 2-3 hours
  • Disk space: ~2GB

Continue? [y/N]:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions