Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
Validation Methodology: Compromised Validation Using Actuals
Date: November 13, 2025 Status: Accepted by User Purpose: Document limitations and expected optimism bias in Sept 2025 validation
Executive Summary
This validation uses actual values instead of forecasts for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a compromised validation approach that represents a lower bound on production MAE, not actual production performance.
Expected Impact: Results will be 20-40% more optimistic than production reality.
Features Using Actuals (Not Forecasts)
1. Weather Features (375 features)
Compromise: Using actual weather values instead of weather forecasts Production Reality: Weather forecasts contain errors that propagate to flow predictions Impact:
- Weather forecast errors typically 1-3°C for temperature, 20-30% for wind
- This represents the largest source of optimism bias
- Expected 15-25% MAE improvement vs. real forecasts
Why Compromised:
- OpenMeteo API does not provide historical forecast archives
- Only current forecasts + historical actuals available
- Cannot reconstruct "forecast as of Oct 1" for October validation
2. CNEC Outage Features (176 features)
Compromise: Using actual outages instead of planned outage forecasts Production Reality: Outage schedules change (cancellations, extensions, unplanned events) Impact:
- Outage forecast accuracy ~80-90% (planned outages fairly reliable)
- Expected 3-7% MAE improvement vs. real outage forecasts
Why Compromised:
- ENTSO-E Transparency API does not easily expose outage version history
- Could potentially collect with advanced queries (future work)
- Current dataset contains final outage data, not forecasts
3. LTA Features (40 features)
Compromise: Using actual LTA values instead of forward-filled from D+0 Production Reality: LTA published weeks ahead, minimal uncertainty Impact:
- LTA values are very stable (long-term allocations)
- Expected <1% MAE impact (negligible)
Why Compromised:
- JAO API could provide this, but requires additional implementation
- LTA uncertainty minimal compared to weather/load forecasts
4. Load Forecast Features (12 features)
Compromise: Using actual demand instead of day-ahead load forecasts Production Reality: Load forecasts have 1-3% MAPEerror Impact:
- Load forecast error contributes to flow prediction error
- Expected 5-10% MAE improvement vs. real load forecasts
Why Compromised:
- ENTSO-E day-ahead load forecasts available but requires separate collection
- Currently using actual demand from historical data
Features Using Correct Data (No Compromise)
Temporal Features (12 features)
- Hour, day, month, weekday encodings
- Always known perfectly - no forecast error possible
Historical Features (1,899 features)
- Prices, generation, demand, lags, CNEC bindings
- Only used in context window - not forecast ahead
- Correct usage: These are known values up to run_date
Expected Optimism Bias Summary
| Feature Category | Count | Forecast Error | Bias Contribution |
|---|---|---|---|
| Weather | 375 | High (20-30%) | +15-25% MAE bias |
| Load Forecasts | 12 | Medium (1-3%) | +5-10% MAE bias |
| CNEC Outages | 176 | Low (10-20%) | +3-7% MAE bias |
| LTA | 40 | Negligible | <1% MAE bias |
| Total Expected | 603 | Combined | +20-40% total |
Interpretation: If validation shows 100 MW MAE, expect 120-140 MW MAE in production.
Validation Framing
What This Validation Proves
✅ Pipeline Correctness: DynamicForecast system works mechanically ✅ Leakage Prevention: Time-aware extraction prevents data leakage ✅ Model Capability: Chronos 2 can learn cross-border flow patterns ✅ Lower Bound: Establishes best-case performance envelope ✅ Comparative Studies: Fair baseline for model comparisons
What This Validation Does NOT Prove
❌ Production Accuracy: Real MAE will be 20-40% higher ❌ Operational Readiness: Requires prospective validation ❌ Feature Importance: Cannot isolate weather vs. structural effects ❌ Forecast Skill: Using perfect information, not forecasts
Precedents in ML Forecasting Literature
This compromised approach is common and accepted in ML research when properly documented:
Academic Precedents
IEEE Power & Energy Society Journals:
- Many load/renewable forecasting papers use actual weather for validation
- Framed as "perfect weather information" scenarios
- Cited to establish theoretical performance bounds
Energy Forecasting Competitions:
- Some tracks explicitly provide actual values for covariates
- Focus on model architecture, not forecast accuracy
- Clearly labeled as "oracle" scenarios
Weather-Dependent Forecasting:
- Wind power forecasting research often uses actual wind observations
- Standard practice when evaluating model capacity independently
Key Requirement
Explicit documentation of limitations (as provided in this document).
Mitigation Strategies
1. Clear Communication
- ALWAYS state "using actuals for weather/outages/load"
- Frame results as "lower bound on production MAE"
- Never claim production-ready without prospective validation
2. Ablation Studies (Future Work)
- Remove weather features → measure MAE increase
- Remove outage features → measure contribution
- Quantify: "Weather contributes ~X MW to MAE"
3. Synthetic Forecast Degradation (Future Work)
- Add Gaussian noise to weather features (σ = 2°C for temperature)
- Simulate load forecast error (~2% MAPE)
- Re-evaluate with "noisy forecasts" → closer to production
4. Prospective Validation (November 2025+)
- Collect proper forecasts daily starting Nov 1
- Run forecasts using day-ahead weather/load/outages
- Compare Oct (optimistic) vs. Nov (realistic)
Comparison to Baseline Models
Even with compromised validation, comparisons are valid if:
- ✅ All models use same compromised data (fair comparison)
- ✅ Baseline models clearly defined (persistence, seasonal naive, ARIMA)
- ✅ Relative performance matters more than absolute MAE
Example:
Model | Sept MAE (Compromised) | Relative to Persistence
-------------------|------------------------|------------------------
Persistence | 250 MW | 1.00x (baseline)
Seasonal Naive | 210 MW | 0.84x
Chronos 2 (ours) | 120 MW | 0.48x ← Valid comparison
Validation Results Interpretation Guide
If Sept MAE = 100 MW
- Lower bound established: Pipeline works mechanically
- Production expectation: 120-140 MW MAE
- Target assessment: Still below 134 MW target? ✅ Good sign
- Action: Proceed to prospective validation
If Sept MAE = 150 MW
- Lower bound established: 150 MW with perfect info
- Production expectation: 180-210 MW MAE
- Target assessment: Above 134 MW target ❌ Problem
- Action: Investigate errors before production
If Sept MAE = 200+ MW
- Systematic issue: Even perfect information insufficient
- Action: Debug feature engineering, check for bugs
Recommended Reporting Language
Good ✅
"Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a lower bound on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)."
Acceptable ✅
"Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment."
Misleading ❌
"The system achieves 120 MW MAE on validation data and is ready for production." (Omits limitations, implies production readiness)
Conclusion
This compromised validation approach is:
- ✅ Acceptable in ML research with proper documentation
- ✅ Useful for proving pipeline correctness and model capability
- ✅ Valid for comparative studies (vs. baselines, ablations)
- ❌ NOT sufficient for claiming production accuracy
- ❌ NOT a substitute for prospective validation
Next Steps:
- Run Sept validation with this methodology
- Document results with limitations clearly stated
- Begin November prospective validation collection
- Compare Oct (optimistic) vs. Nov (realistic) in ~30 days
Approved By: User (Nov 13, 2025) Rationale: "We need results immediately. Option A (compromised validation) is acceptable if properly documented."