This module provides comprehensive statistical analysis, performance profiling, and model evaluation capabilities for GNN models and pipeline components.
src/analysis/
├── __init__.py # Module initialization and exports
├── README.md # This documentation
├── processor.py # Main analysis processor
├── analyzer.py # Statistical analysis functions
├── post_simulation.py # Post-simulation analysis
└── mcp.py # Model Context Protocol integration
graph TB
subgraph "Input Processing"
GNNFiles[GNN Files]
ExecResults[Execution Results]
Processor[processor.py]
end
subgraph "Analysis Components"
Statistical[Statistical Analysis]
Complexity[Complexity Metrics]
Performance[Performance Benchmarks]
Comparison[Model Comparison]
end
subgraph "Post-Simulation Analysis"
Traces[Simulation Traces]
FreeEnergy[Free Energy Analysis]
Policy[Policy Convergence]
StateDist[State Distributions]
end
subgraph "Output Generation"
Summary[Analysis Summary]
Reports[Comparison Reports]
Metrics[Performance Metrics]
Visualizations[Visualizations]
end
GNNFiles --> Processor
ExecResults --> Processor
Processor --> Statistical
Processor --> Complexity
Processor --> Performance
Processor --> Comparison
ExecResults --> Traces
Traces --> FreeEnergy
Traces --> Policy
Traces --> StateDist
Statistical --> Summary
Complexity --> Summary
Performance --> Metrics
Comparison --> Reports
FreeEnergy --> Visualizations
Policy --> Visualizations
flowchart LR
subgraph "Pipeline Step 16"
Step16[16_analysis.py Orchestrator]
end
subgraph "Analysis Module"
Processor[processor.py]
Analyzer[analyzer.py]
PostSim[post_simulation.py]
end
subgraph "Input Sources"
Step3[Step 3: GNN]
Step12[Step 12: Execute]
Step13[Step 13: LLM]
end
subgraph "Downstream Steps"
Step20[Step 20: Website]
Step23[Step 23: Report]
end
Step16 --> Processor
Processor --> Analyzer
Processor --> PostSim
Step3 -->|Model Data| Processor
Step12 -->|Execution Results| Processor
Step13 -->|LLM Insights| Processor
Processor -->|Analysis Results| Step20
Processor -->|Analysis Results| Step23
Performs comprehensive statistical analysis on GNN model files.
Features:
- Variable distribution analysis
- Connection pattern analysis
- Complexity metrics calculation
- Performance benchmarking
- Model comparison capabilities
Returns:
- Dictionary containing comprehensive analysis results
- Statistical summaries and metrics
- Performance benchmarks
- Model comparison data
Extracts and analyzes variables from GNN content.
Features:
- Variable type classification
- Dimension analysis
- Data type validation
- Complexity assessment
Extracts and analyzes connections from GNN content.
Features:
- Connection pattern analysis
- Dependency mapping
- Graph structure analysis
- Connectivity metrics
Extracts and analyzes GNN sections for comprehensive analysis.
Features:
- Section type classification
- Content structure analysis
- Semantic analysis
- Validation metrics
Calculates comprehensive statistics for variables.
Metrics:
- Type distribution
- Dimension statistics
- Complexity measures
- Memory usage estimates
Calculates statistics for model connections.
Metrics:
- Connection density
- Graph metrics
- Dependency patterns
- Structural complexity
Calculates statistics for GNN sections.
Metrics:
- Section distribution
- Content analysis
- Validation status
- Quality metrics
calculate_cyclomatic_complexity(variables: List[Dict[str, Any]], connections: List[Dict[str, Any]]) -> float
Calculates cyclomatic complexity of the model.
Formula:
Complexity = E - N + 2P
Where:
- E = Number of edges (connections)
- N = Number of nodes (variables)
- P = Number of connected components
calculate_cognitive_complexity(variables: List[Dict[str, Any]], connections: List[Dict[str, Any]]) -> float
Calculates cognitive complexity based on model structure.
Factors:
- Variable type diversity
- Connection patterns
- Nesting levels
- Semantic complexity
calculate_structural_complexity(variables: List[Dict[str, Any]], connections: List[Dict[str, Any]]) -> float
Calculates structural complexity metrics.
Metrics:
- Graph density
- Clustering coefficient
- Path length analysis
- Modularity measures
Runs comprehensive performance benchmarks.
Benchmarks:
- Processing time analysis
- Memory usage profiling
- CPU utilization
- I/O performance
- Scalability testing
Calculates comprehensive complexity metrics.
Metrics:
- Cyclomatic complexity
- Cognitive complexity
- Structural complexity
- Maintainability index
- Technical debt assessment
calculate_maintainability_index(content: str, variables: List[Dict[str, Any]], connections: List[Dict[str, Any]]) -> float
Calculates maintainability index for the model.
Formula:
MI = 171 - 5.2 * ln(HV) - 0.23 * ln(CC) - 16.2 * ln(LOC)
Where:
- HV = Halstead Volume
- CC = Cyclomatic Complexity
- LOC = Lines of Code
calculate_technical_debt(content: str, variables: List[Dict[str, Any]], connections: List[Dict[str, Any]]) -> float
Calculates technical debt for the model.
Factors:
- Code quality issues
- Complexity penalties
- Documentation gaps
- Testing coverage
- Performance bottlenecks
perform_model_comparisons(statistical_analyses: List[Dict[str, Any]], verbose: bool = False) -> Dict[str, Any]
Performs comparative analysis across multiple models.
Comparisons:
- Performance benchmarking
- Complexity comparison
- Quality assessment
- Feature analysis
- Best practices evaluation
Generates comprehensive analysis summary.
Content:
- Executive summary
- Key metrics
- Recommendations
- Risk assessment
- Improvement suggestions
from analysis import perform_statistical_analysis
# Analyze a GNN model file
results = perform_statistical_analysis(
file_path=Path("models/my_model.md"),
verbose=True
)
print(f"Model complexity: {results['complexity_metrics']['cyclomatic']}")
print(f"Variable count: {results['statistics']['variable_count']}")
print(f"Connection count: {results['statistics']['connection_count']}")from analysis import (
extract_variables_for_analysis,
extract_connections_for_analysis,
calculate_variable_statistics,
calculate_connection_statistics
)
# Extract and analyze components
variables = extract_variables_for_analysis(gnn_content)
connections = extract_connections_for_analysis(gnn_content)
# Calculate statistics
var_stats = calculate_variable_statistics(variables)
conn_stats = calculate_connection_statistics(connections)
print(f"Variable types: {var_stats['type_distribution']}")
print(f"Connection density: {conn_stats['density']}")from analysis import run_performance_benchmarks
# Run performance benchmarks
benchmarks = run_performance_benchmarks(
file_path=Path("models/large_model.md"),
verbose=True
)
print(f"Processing time: {benchmarks['processing_time']:.3f}s")
print(f"Memory usage: {benchmarks['memory_usage']:.2f}MB")
print(f"CPU utilization: {benchmarks['cpu_utilization']:.1f}%")from analysis import (
calculate_cyclomatic_complexity,
calculate_cognitive_complexity,
calculate_structural_complexity
)
# Calculate complexity metrics
cyclomatic = calculate_cyclomatic_complexity(variables, connections)
cognitive = calculate_cognitive_complexity(variables, connections)
structural = calculate_structural_complexity(variables, connections)
print(f"Cyclomatic complexity: {cyclomatic:.2f}")
print(f"Cognitive complexity: {cognitive:.2f}")
print(f"Structural complexity: {structural:.2f}")from analysis import (
calculate_maintainability_index,
calculate_technical_debt
)
# Assess model quality
maintainability = calculate_maintainability_index(content, variables, connections)
tech_debt = calculate_technical_debt(content, variables, connections)
print(f"Maintainability index: {maintainability:.2f}")
print(f"Technical debt: {tech_debt:.2f}")graph TD
Input[GNN Model] --> Extract[Data Extraction]
Extract --> Vars[Variables]
Extract --> Conns[Connections]
Extract --> Sections[Sections]
Vars & Conns & Sections --> Stats[Statistical Analysis]
Vars & Conns --> Complex[Complexity Assessment]
Stats --> StatsRep[Statistical Report]
Complex --> ComplexRep[Complexity Report]
Input --> Perf[Performance Benchmarks]
Input --> Quality[Quality Assessment]
Perf --> PerfRep[Performance Report]
Quality --> QualRep[Quality Report]
StatsRep & ComplexRep & PerfRep & QualRep --> Summary[Analysis Summary]
# Extract model components
variables = extract_variables_for_analysis(content)
connections = extract_connections_for_analysis(content)
sections = extract_sections_for_analysis(content)# Calculate comprehensive statistics
var_stats = calculate_variable_statistics(variables)
conn_stats = calculate_connection_statistics(connections)
section_stats = calculate_section_statistics(sections)# Assess model complexity
complexity_metrics = {
'cyclomatic': calculate_cyclomatic_complexity(variables, connections),
'cognitive': calculate_cognitive_complexity(variables, connections),
'structural': calculate_structural_complexity(variables, connections)
}# Evaluate performance characteristics
performance = run_performance_benchmarks(file_path)# Assess model quality
quality_metrics = {
'maintainability': calculate_maintainability_index(content, variables, connections),
'technical_debt': calculate_technical_debt(content, variables, connections)
}# Called from 16_analysis.py
def process_analysis(target_dir, output_dir, verbose=False, **kwargs):
# Perform comprehensive analysis
results = perform_statistical_analysis(file_path, verbose)
# Generate analysis report
summary = generate_analysis_summary(results)
# Save results
save_analysis_results(results, output_dir)
return Trueoutput/16_analysis_output/
├── statistical_analysis.json # Comprehensive analysis results
├── performance_benchmarks.json # Performance metrics
├── complexity_metrics.json # Complexity analysis
├── quality_assessment.json # Quality metrics
├── model_comparison.json # Comparative analysis
└── analysis_summary.md # Human-readable summary
- Variable Count: Total number of variables
- Connection Count: Total number of connections
- Type Distribution: Distribution of variable types
- Dimension Analysis: Variable dimension statistics
- Density Metrics: Connection density and patterns
- Cyclomatic Complexity: Graph-based complexity measure
- Cognitive Complexity: Human comprehension difficulty
- Structural Complexity: Model structure complexity
- Maintainability Index: Code maintainability score
- Technical Debt: Quality and maintainability debt
- Processing Time: Model processing duration
- Memory Usage: Memory consumption during processing
- CPU Utilization: CPU usage patterns
- I/O Performance: Input/output performance
- Scalability: Performance scaling characteristics
- Code Quality: Overall code quality assessment
- Documentation Coverage: Documentation completeness
- Testing Coverage: Test coverage metrics
- Best Practices: Adherence to best practices
- Risk Assessment: Potential risk factors
# Configuration options
config = {
'verbose': True, # Enable detailed logging
'include_performance': True, # Include performance analysis
'include_complexity': True, # Include complexity analysis
'include_quality': True, # Include quality assessment
'benchmark_iterations': 5, # Number of benchmark iterations
'memory_profiling': True, # Enable memory profiling
'cpu_profiling': True # Enable CPU profiling
}# Define custom analysis metrics
custom_metrics = {
'custom_complexity': lambda v, c: custom_complexity_calculation(v, c),
'custom_quality': lambda content, v, c: custom_quality_assessment(content, v, c)
}# Handle analysis failures gracefully
try:
results = perform_statistical_analysis(file_path)
except AnalysisError as e:
logger.error(f"Analysis failed: {e}")
# Provide recovery analysis or error reporting# Validate input data before analysis
if not validate_gnn_content(content):
raise ValueError("Invalid GNN content for analysis")- Caching: Cache analysis results for repeated analysis
- Parallel Processing: Use parallel processing for large models
- Memory Management: Optimize memory usage for large datasets
- Incremental Analysis: Support incremental analysis for large models
- Large Models: Handle models with thousands of variables
- Batch Processing: Process multiple models efficiently
- Resource Management: Manage CPU and memory resources
- Progress Tracking: Track analysis progress for long-running operations
# Test individual analysis functions
def test_variable_statistics():
variables = extract_variables_for_analysis(test_content)
stats = calculate_variable_statistics(variables)
assert 'type_distribution' in stats
assert 'count' in stats# Test complete analysis pipeline
def test_analysis_pipeline():
results = perform_statistical_analysis(test_file)
assert 'statistics' in results
assert 'complexity_metrics' in results
assert 'performance_benchmarks' in results- numpy: Numerical computations
- pandas: Data manipulation and analysis
- networkx: Graph analysis and metrics
- matplotlib: Statistical plotting
- scipy: Statistical functions
- psutil: System resource monitoring
- memory_profiler: Memory usage profiling
- line_profiler: Line-by-line profiling
- Small Models (< 100 variables): < 0.1 seconds
- Medium Models (100-1000 variables): 0.1-1.0 seconds
- Large Models (> 1000 variables): 1.0-10.0 seconds
- Base Memory: ~20MB
- Per Model: ~5-50MB depending on complexity
- Peak Memory: 1.5-2x base usage during analysis
Error: MemoryError during large model analysis
Solution: Enable memory optimization or process in chunks
Error: Analysis taking too long for large models
Solution: Enable parallel processing or use sampling
Error: Invalid GNN content for analysis
Solution: Validate input data before analysis
# Enable debug mode for detailed analysis
results = perform_statistical_analysis(file_path, verbose=True, debug=True)- Machine Learning Analysis: ML-based model assessment
- Predictive Analytics: Performance prediction capabilities
- Real-time Analysis: Live analysis during model development
- Advanced Visualizations: Interactive analysis visualizations
- GPU Acceleration: GPU-accelerated analysis for large models
- Distributed Processing: Distributed analysis for very large models
- Streaming Analysis: Real-time streaming analysis capabilities
The Analysis module provides comprehensive statistical analysis, performance profiling, and model evaluation capabilities for GNN models. The module includes sophisticated complexity metrics, quality assessment tools, and performance benchmarking capabilities to support Active Inference research and development.
This module is part of the GeneralizedNotationNotation project. See the main repository for license and citation information.
- Project overview: ../../README.md
- Comprehensive docs: ../../DOCS.md
- Architecture guide: ../../ARCHITECTURE.md
- Pipeline details: ../../doc/pipeline/README.md