Spaces:

evgueni-p
/

fbmc-chronos2

Sleeping

App Files Files Community

fbmc-chronos2 / doc /validation_methodology.md

Evgueni Poloukarov

fix: add Windows multiprocessing protection and validation methodology

e5f4fec 28 days ago

preview code

raw

history blame contribute delete

8.94 kB

	# Validation Methodology: Compromised Validation Using Actuals

	Date: November 13, 2025
	Status: Accepted by User
	Purpose: Document limitations and expected optimism bias in Sept 2025 validation

	---

	## Executive Summary

	This validation uses actual values instead of forecasts for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a compromised validation approach that represents a lower bound on production MAE, not actual production performance.

	Expected Impact: Results will be 20-40% more optimistic than production reality.

	---

	## Features Using Actuals (Not Forecasts)

	### 1. Weather Features (375 features)
	Compromise: Using actual weather values instead of weather forecasts
	Production Reality: Weather forecasts contain errors that propagate to flow predictions
	Impact:
	- Weather forecast errors typically 1-3°C for temperature, 20-30% for wind
	- This represents the largest source of optimism bias
	- Expected 15-25% MAE improvement vs. real forecasts

	Why Compromised:
	- OpenMeteo API does not provide historical forecast archives
	- Only current forecasts + historical actuals available
	- Cannot reconstruct "forecast as of Oct 1" for October validation

	### 2. CNEC Outage Features (176 features)
	Compromise: Using actual outages instead of planned outage forecasts
	Production Reality: Outage schedules change (cancellations, extensions, unplanned events)
	Impact:
	- Outage forecast accuracy ~80-90% (planned outages fairly reliable)
	- Expected 3-7% MAE improvement vs. real outage forecasts

	Why Compromised:
	- ENTSO-E Transparency API does not easily expose outage version history
	- Could potentially collect with advanced queries (future work)
	- Current dataset contains final outage data, not forecasts

	### 3. LTA Features (40 features)
	Compromise: Using actual LTA values instead of forward-filled from D+0
	Production Reality: LTA published weeks ahead, minimal uncertainty
	Impact:
	- LTA values are very stable (long-term allocations)
	- Expected <1% MAE impact (negligible)

	Why Compromised:
	- JAO API could provide this, but requires additional implementation
	- LTA uncertainty minimal compared to weather/load forecasts

	### 4. Load Forecast Features (12 features)
	Compromise: Using actual demand instead of day-ahead load forecasts
	Production Reality: Load forecasts have 1-3% MAPEerror
	Impact:
	- Load forecast error contributes to flow prediction error
	- Expected 5-10% MAE improvement vs. real load forecasts

	Why Compromised:
	- ENTSO-E day-ahead load forecasts available but requires separate collection
	- Currently using actual demand from historical data

	---

	## Features Using Correct Data (No Compromise)

	### Temporal Features (12 features)
	- Hour, day, month, weekday encodings
	- Always known perfectly - no forecast error possible

	### Historical Features (1,899 features)
	- Prices, generation, demand, lags, CNEC bindings
	- Only used in context window - not forecast ahead
	- Correct usage: These are known values up to run_date

	---

	## Expected Optimism Bias Summary

	\| Feature Category \| Count \| Forecast Error \| Bias Contribution \|
	\|-----------------\|-------\|----------------\|-------------------\|
	\| Weather \| 375 \| High (20-30%) \| +15-25% MAE bias \|
	\| Load Forecasts \| 12 \| Medium (1-3%) \| +5-10% MAE bias \|
	\| CNEC Outages \| 176 \| Low (10-20%) \| +3-7% MAE bias \|
	\| LTA \| 40 \| Negligible \| <1% MAE bias \|
	\| Total Expected \| 603 \| Combined \| +20-40% total \|

	Interpretation: If validation shows 100 MW MAE, expect 120-140 MW MAE in production.

	---

	## Validation Framing

	### What This Validation Proves
	✅ Pipeline Correctness: DynamicForecast system works mechanically
	✅ Leakage Prevention: Time-aware extraction prevents data leakage
	✅ Model Capability: Chronos 2 can learn cross-border flow patterns
	✅ Lower Bound: Establishes best-case performance envelope
	✅ Comparative Studies: Fair baseline for model comparisons

	### What This Validation Does NOT Prove
	❌ Production Accuracy: Real MAE will be 20-40% higher
	❌ Operational Readiness: Requires prospective validation
	❌ Feature Importance: Cannot isolate weather vs. structural effects
	❌ Forecast Skill: Using perfect information, not forecasts

	---

	## Precedents in ML Forecasting Literature

	This compromised approach is common and accepted in ML research when properly documented:

	### Academic Precedents
	1. IEEE Power & Energy Society Journals:
	- Many load/renewable forecasting papers use actual weather for validation
	- Framed as "perfect weather information" scenarios
	- Cited to establish theoretical performance bounds

	2. Energy Forecasting Competitions:
	- Some tracks explicitly provide actual values for covariates
	- Focus on model architecture, not forecast accuracy
	- Clearly labeled as "oracle" scenarios

	3. Weather-Dependent Forecasting:
	- Wind power forecasting research often uses actual wind observations
	- Standard practice when evaluating model capacity independently

	### Key Requirement
	Explicit documentation of limitations (as provided in this document).

	---

	## Mitigation Strategies

	### 1. Clear Communication
	- ALWAYS state "using actuals for weather/outages/load"
	- Frame results as "lower bound on production MAE"
	- Never claim production-ready without prospective validation

	### 2. Ablation Studies (Future Work)
	- Remove weather features → measure MAE increase
	- Remove outage features → measure contribution
	- Quantify: "Weather contributes ~X MW to MAE"

	### 3. Synthetic Forecast Degradation (Future Work)
	- Add Gaussian noise to weather features (σ = 2°C for temperature)
	- Simulate load forecast error (~2% MAPE)
	- Re-evaluate with "noisy forecasts" → closer to production

	### 4. Prospective Validation (November 2025+)
	- Collect proper forecasts daily starting Nov 1
	- Run forecasts using day-ahead weather/load/outages
	- Compare Oct (optimistic) vs. Nov (realistic)

	---

	## Comparison to Baseline Models

	Even with compromised validation, comparisons are valid if:
	- ✅ All models use same compromised data (fair comparison)
	- ✅ Baseline models clearly defined (persistence, seasonal naive, ARIMA)
	- ✅ Relative performance matters more than absolute MAE

	Example:
	```
	Model \| Sept MAE (Compromised) \| Relative to Persistence
	-------------------\|------------------------\|------------------------
	Persistence \| 250 MW \| 1.00x (baseline)
	Seasonal Naive \| 210 MW \| 0.84x
	Chronos 2 (ours) \| 120 MW \| 0.48x ← Valid comparison
	```

	---

	## Validation Results Interpretation Guide

	### If Sept MAE = 100 MW
	- Lower bound established: Pipeline works mechanically
	- Production expectation: 120-140 MW MAE
	- Target assessment: Still below 134 MW target? ✅ Good sign
	- Action: Proceed to prospective validation

	### If Sept MAE = 150 MW
	- Lower bound established: 150 MW with perfect info
	- Production expectation: 180-210 MW MAE
	- Target assessment: Above 134 MW target ❌ Problem
	- Action: Investigate errors before production

	### If Sept MAE = 200+ MW
	- Systematic issue: Even perfect information insufficient
	- Action: Debug feature engineering, check for bugs

	---

	## Recommended Reporting Language

	### Good ✅
	> "Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a lower bound on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)."

	### Acceptable ✅
	> "Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment."

	### Misleading ❌
	> "The system achieves 120 MW MAE on validation data and is ready for production."
	(Omits limitations, implies production readiness)

	---

	## Conclusion

	This compromised validation approach is:
	- ✅ Acceptable in ML research with proper documentation
	- ✅ Useful for proving pipeline correctness and model capability
	- ✅ Valid for comparative studies (vs. baselines, ablations)
	- ❌ NOT sufficient for claiming production accuracy
	- ❌ NOT a substitute for prospective validation

	Next Steps:
	1. Run Sept validation with this methodology
	2. Document results with limitations clearly stated
	3. Begin November prospective validation collection
	4. Compare Oct (optimistic) vs. Nov (realistic) in ~30 days

	---

	Approved By: User (Nov 13, 2025)
	Rationale: "We need results immediately. Option A (compromised validation) is acceptable if properly documented."