-
Notifications
You must be signed in to change notification settings - Fork 39
Add Early Input Validation in CLI #16
Copy link
Copy link
Open
Description
The CLI accepts a config file but doesn't validate it until late in the execution pipeline. This means:
- Users wait through expensive setup (model loading, parsing) only to fail on invalid config
- Error messages appear after significant processing
- No quick feedback loop
Current Flow
1. User runs: sdg generate config.yaml
2. Config file loaded
3. Document parsing starts (slow!)
4. Model initialization begins
5. ❌ ERROR: Invalid generation config
User wasted 2-3 minutes before seeing the error.
Proposed Flow
1. User runs: sdg generate config.yaml
2. ✅ Config validation (< 1 second)
3. ❌ ERROR: Invalid generation config
Exit immediately
Implementation
Add Validation Command
# In cli.py
@app.command()
def validate(config_path: Path):
"""Validate configuration file without running."""
try:
config = SDGConfig.from_yaml(config_path)
console.print("[green]✓ Configuration is valid[/green]")
# Show warnings
if config.generation.num_samples > 10000:
console.print("[yellow]⚠ Large sample count may take hours[/yellow]")
# Estimate resources
estimated_tokens = estimate_token_usage(config)
estimated_cost = estimated_tokens * 0.00001 # Example pricing
console.print(f"[blue]Estimated cost: ${estimated_cost:.2f}[/blue]")
except ValidationError as e:
console.print(f"[red]✗ Invalid configuration:[/red]")
for error in e.errors():
console.print(f" {error['loc']}: {error['msg']}")
sys.exit(1)Enhance Main Command
@app.command()
def generate(
config_path: Path,
validate_only: bool = typer.Option(False, "--validate-only", help="Only validate config")
):
"""Generate synthetic data."""
# Early validation
try:
config = SDGConfig.from_yaml(config_path)
except ValidationError as e:
console.print("[red]Configuration errors:[/red]")
for error in e.errors():
field = " → ".join(str(loc) for loc in error['loc'])
console.print(f" [yellow]{field}[/yellow]: {error['msg']}")
raise typer.Exit(1)
if validate_only:
console.print("[green]✓ Configuration is valid[/green]")
raise typer.Exit(0)
# Check prerequisites
if config.task.method == "local" and not config.task.document_path:
console.print("[red]Error: document_path required for local method[/red]")
raise typer.Exit(1)
if config.task.method == "web" and not config.task.dataset_id:
console.print("[red]Error: dataset_id required for web method[/red]")
raise typer.Exit(1)
# File existence checks
if config.task.method == "local":
doc_path = Path(config.task.document_path)
if not doc_path.exists():
console.print(f"[red]Error: Document not found: {doc_path}[/red]")
raise typer.Exit(1)
# Continue with execution...Add Pre-flight Checks
def validate_environment(config: SDGConfig) -> list[str]:
"""Check environment prerequisites."""
warnings = []
# Check GPU availability
if config.model.device == "cuda" and not torch.cuda.is_available():
warnings.append("CUDA requested but not available. Falling back to CPU.")
# Check API keys
if config.model.provider == "openai" and not os.getenv("OPENAI_API_KEY"):
raise ValueError("OPENAI_API_KEY not set")
if config.model.provider == "anthropic" and not os.getenv("ANTHROPIC_API_KEY"):
raise ValueError("ANTHROPIC_API_KEY not set")
# Check disk space
cache_dir = Path(".cache")
stat = os.statvfs(cache_dir)
free_gb = (stat.f_bavail * stat.f_frsize) / (1024**3)
if free_gb < 10:
warnings.append(f"Low disk space: {free_gb:.1f}GB free")
return warningsExample Output
$ sdg generate config.yaml
⚙ Validating configuration...
✓ Config structure valid
✓ Model configuration valid
✓ Task configuration valid
⚠ Warnings:
• Large sample count (10000) - estimated 2-3 hours
• GPU not available - using CPU (slower)
💰 Cost Estimate:
• Tokens: ~500,000
• Estimated cost: $5.00 (gpt-4)
📊 Resource Estimate:
• Time: 2-3 hours
• Disk space: ~2GB
Continue? [y/N]:
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels