Evgueni Poloukarov commited on
Commit
af88e60
·
1 Parent(s): 8fd4a0e

feat: fix future covariate architecture (615 features: temporal, weather, 176 CNEC outages)

Browse files

Critical fixes to future covariate identification for Chronos 2 inference:

ISSUE 1: CNEC Transmission Outages (31 → 176 features)
- Problem: Only 31 CNECs with historical outages had features
- During inference, any of 176 CNECs could have planned future outages
- Model was blind to 82% of transmission network outages
- Solution: Preserve all 176 zero-filled CNEC features in cleanup

ISSUE 2: Weather Features Not Marked (0 → 375 features)
- Problem: 375 weather features existed but not marked as future covariates
- ECMWF D+15 forecasts wouldn't be used by model
- Solution: Include weather_cols in metadata future covariate check

ISSUE 3: Temporal Features Missing (0 → 12 features)
- Problem: Deterministic features (hour, day, weekday) not marked as future
- Model couldn't leverage known temporal patterns
- Solution: Add temporal_cols to future covariate identification

Changes:
- src/feature_engineering/engineer_entsoe_features.py:
* Skip CNEC outages in zero-variance cleanup (lines 843-858)
* Skip CNEC outages in duplicate removal (lines 860-883)
* Result: 296 → 441 ENTSO-E features (+145 CNEC outages)

- notebooks/05_unified_features_final.py:
* Add temporal features to identification (lines 287-320)
* Include temporal + weather in metadata (lines 931-976)
* Update summary table and text (lines 382-418, 1034)

- doc/activity.md:
* Comprehensive documentation of fixes and rationale
* Future covariate architecture and inference strategy
* Data quality validation results

Results:
- Total features: 2,408 → 2,553 (+145)
- Future covariates: 83 → 615 (+532)
* Temporal: 12 (deterministic)
* LTA: 40 (years ahead)
* Load Forecasts: 12 (D+1)
* CNEC Outages: 176 (D+22)
* Weather: 375 (D+15 ECMWF)
- Historical features: 1,938

Data regenerated:
- features_entsoe_24month.parquet (10.67 MB, 441 features)
- features_unified_24month.parquet (24.9 MB, 2,553 features)
- features_unified_metadata.csv (615 future covariates marked)

All validations passed. Ready for Day 3 zero-shot inference.

doc/activity.md CHANGED
@@ -3260,15 +3260,278 @@ Successfully unified all three feature sets (JAO, ENTSO-E, Weather) into a singl
3260
 
3261
  ---
3262
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3263
  **Status Update**:
3264
  - Day 0: ✅ Setup complete
3265
  - Day 1: ✅ Data collection complete (JAO, ENTSO-E, Weather)
3266
  - Day 2: ✅ Feature engineering complete (JAO, ENTSO-E, Weather)
3267
- - **Day 2.5: ✅ Feature unification complete** (2,408 features ready)
 
3268
  - Day 3: ⏳ Zero-shot inference (NEXT)
3269
  - Day 4: ⏳ Evaluation
3270
  - Day 5: ⏳ Documentation + handover
3271
 
3272
  **NEXT SESSION BOOKMARK**: Day 3 - Implement Chronos 2 zero-shot inference pipeline
3273
 
3274
- **Ready for Inference**: ✅ Unified dataset validated and production-ready
 
3260
 
3261
  ---
3262
 
3263
+ ## 2025-11-11 - Future Covariate Architecture Fixed ✅
3264
+
3265
+ ### Summary
3266
+ Identified and fixed critical gaps in future covariate identification: CNEC transmission outages (31 → 176), weather features (not marked), and temporal features (missing). Rebuilt ENTSO-E feature engineering with cleanup safeguards, regenerated unified dataset, and updated metadata. Final system: **2,553 features (615 future, 1,938 historical)** ready for Chronos 2 zero-shot inference.
3267
+
3268
+ ### Issues Identified
3269
+
3270
+ #### 1. CNEC Transmission Outages Insufficient (31 vs 176)
3271
+ **Problem**: Only 31 CNECs with historical outages had features. During inference, ANY of the 176 master CNECs could have planned future outages, but the model couldn't receive that information.
3272
+
3273
+ **Root Cause**: Feature engineering cleanup logic removed 145 zero-filled CNEC features as "zero-variance" and "duplicates."
3274
+
3275
+ **Impact**: Model blind to future outages for 145 CNECs (82% of transmission network).
3276
+
3277
+ #### 2. Weather Features Not Marked as Future Covariates (0 vs 375)
3278
+ **Problem**: 375 weather features exist but metadata marked them as historical, not future covariates.
3279
+
3280
+ **Root Cause**: Notebook metadata generation (`create_metadata()` line 942) only checked LTA, load forecasts, and outages - excluded weather.
3281
+
3282
+ **Impact**: During inference, ECMWF D+15 forecasts wouldn't be used by Chronos 2.
3283
+
3284
+ #### 3. Temporal Features Missing from Future Covariates (0 vs 12)
3285
+ **Problem**: Temporal features (hour, day, weekday, etc.) always known deterministically but not marked as future covariates.
3286
+
3287
+ **Root Cause**: Not included in future covariate identification logic.
3288
+
3289
+ **Impact**: Model couldn't leverage known future temporal patterns.
3290
+
3291
+ ### Work Completed
3292
+
3293
+ #### 1. Fixed ENTSO-E Feature Engineering
3294
+ **File**: `src/feature_engineering/engineer_entsoe_features.py`
3295
+
3296
+ **Changes Made**:
3297
+ ```python
3298
+ # Line 843-858: Zero-variance cleanup - skip transmission outages
3299
+ if col.startswith('outage_cnec_'):
3300
+ continue # Keep even if zero-filled
3301
+
3302
+ # Line 860-883: Duplicate removal - skip transmission outages
3303
+ if col1.startswith('outage_cnec_') or col2.startswith('outage_cnec_'):
3304
+ continue # Each CNEC needs own column for inference
3305
+ ```
3306
+
3307
+ **Result**:
3308
+ - Before: 296 ENTSO-E features (31 CNEC outages)
3309
+ - After: 441 ENTSO-E features (176 CNEC outages)
3310
+ - Change: +145 zero-filled CNEC outage features preserved
3311
+
3312
+ **Validation**: All 176 CNEC outage features confirmed present in output file.
3313
+
3314
+ #### 2. Updated Unification Notebook
3315
+ **File**: `notebooks/05_unified_features_final.py`
3316
+
3317
+ **Change 1** (line 287-320): Added temporal features to identification
3318
+ ```python
3319
+ # Added:
3320
+ temporal_cols = [c for c in future_cov_all_cols if any(x in c for x in
3321
+ ['hour', 'day', 'month', 'weekday', 'year', 'weekend', '_sin', '_cos'])]
3322
+
3323
+ # Updated return:
3324
+ return temporal_cols, lta_cols, load_forecast_cols, outage_cols, weather_cols, future_cov_counts
3325
+ ```
3326
+
3327
+ **Change 2** (line 382-418): Added temporal row to summary table
3328
+
3329
+ **Change 3** (line 931-976): Updated metadata generation
3330
+ ```python
3331
+ # Line 931: Added temporal_cols and weather_cols to function signature
3332
+ def create_metadata(pl, categories, temporal_cols, lta_cols, load_forecast_cols,
3333
+ outage_cols, weather_cols, outage_stats):
3334
+
3335
+ # Line 948-952: Include temporal and weather in future covariate check
3336
+ meta_is_future = (meta_col in temporal_cols or
3337
+ meta_col in lta_cols or
3338
+ meta_col in load_forecast_cols or
3339
+ meta_col in outage_cols or
3340
+ meta_col in weather_cols)
3341
+
3342
+ # Line 955-966: Added extension periods for temporal and weather
3343
+ if meta_col in temporal_cols:
3344
+ meta_extension_days = 'Full horizon (deterministic)'
3345
+ elif meta_col in weather_cols:
3346
+ meta_extension_days = '15 days (D+15 ECMWF)'
3347
+ ```
3348
+
3349
+ **Change 4** (line 1034): Updated summary text (87 → 615)
3350
+
3351
+ #### 3. Regenerated All Outputs
3352
+
3353
+ **Step 1**: Re-ran ENTSO-E feature engineering
3354
+ ```bash
3355
+ .venv\Scripts\python.exe src\feature_engineering\engineer_entsoe_features.py
3356
+ ```
3357
+ - Output: 441 ENTSO-E features (176 CNEC outages preserved)
3358
+ - File: `data/processed/features_entsoe_24month.parquet` (10.67 MB)
3359
+
3360
+ **Step 2**: Re-ran unification
3361
+ ```bash
3362
+ .venv\Scripts\python.exe scripts\unify_features_checkpoint.py
3363
+ ```
3364
+ - Output: 2,553 total features (17,544 hours × 2,553 columns)
3365
+ - File: `data/processed/features_unified_24month.parquet` (24.9 MB)
3366
+
3367
+ **Step 3**: Regenerated metadata with updated logic
3368
+ - Custom script with temporal + weather future covariate marking
3369
+ - Output: `data/processed/features_unified_metadata.csv`
3370
+ - Result: 615 future covariates correctly identified
3371
+
3372
+ ### Final Feature Architecture
3373
+
3374
+ #### Total Feature Count: 2,553
3375
+
3376
+ | Source | Features | Description |
3377
+ |--------|----------|-------------|
3378
+ | JAO | 1,737 | CNECs, borders, net positions, LTA, temporal |
3379
+ | ENTSO-E | 441 | Generation, demand, prices, load forecasts, outages (176 CNECs) |
3380
+ | Weather | 375 | Temperature, wind, solar, cloud, pressure, lags, derived |
3381
+ | **TOTAL** | **2,553** | **Complete FBMC feature set** |
3382
+
3383
+ #### Future Covariate Breakdown: 615
3384
+
3385
+ | Category | Count | Extension Period | Purpose |
3386
+ |----------|-------|------------------|---------|
3387
+ | **Temporal** | 12 | Full horizon (deterministic) | Hour, day, weekday always known |
3388
+ | **LTA** | 40 | Full horizon (years) | Auction results known in advance |
3389
+ | **Load Forecasts** | 12 | D+1 (1 day) | TSO demand forecasts |
3390
+ | **CNEC Outages** | 176 | Up to D+22 | Planned transmission maintenance |
3391
+ | **Weather** | 375 | D+15 (15 days) | ECMWF IFS 0.25° forecasts |
3392
+ | **TOTAL** | **615** | **Variable** | **24.1% of features** |
3393
+
3394
+ #### Historical Features: 1,938
3395
+
3396
+ These include:
3397
+ - CNEC binding/RAM/utilization (historical congestion)
3398
+ - Border flows and capacities (historical)
3399
+ - Net positions (historical)
3400
+ - PTDF coefficients and interactions
3401
+ - Generation by type (historical)
3402
+ - Day-ahead prices (historical)
3403
+ - Hydro storage levels (historical)
3404
+
3405
+ ### Data Quality Validation
3406
+
3407
+ **Unified Dataset**:
3408
+ - Dimensions: 17,544 rows × 2,553 columns
3409
+ - Date range: Oct 1, 2023 - Sept 30, 2025 (24 months, hourly)
3410
+ - File size: 24.9 MB (compressed parquet)
3411
+ - Timestamp continuity: 100% (no gaps)
3412
+
3413
+ **Completeness by Category**:
3414
+ - Temporal: 100%
3415
+ - LTA: 100%
3416
+ - Border capacity: 99.86%
3417
+ - Net positions: 100%
3418
+ - Load forecasts: 99.73%
3419
+ - Transmission outages: 100% (binary: 0 or 1)
3420
+ - Weather: 100%
3421
+ - Generation/demand: 99.85%
3422
+ - **CNEC features: 26.41%** (expected sparsity - congestion is occasional)
3423
+
3424
+ **Overall Completeness**: 57.11% (due to expected CNEC sparsity)
3425
+
3426
+ ### Files Modified
3427
+
3428
+ **Code Changes**:
3429
+ 1. `src/feature_engineering/engineer_entsoe_features.py`
3430
+ - Lines 843-858: Zero-variance cleanup safeguard
3431
+ - Lines 860-883: Duplicate removal safeguard
3432
+
3433
+ 2. `notebooks/05_unified_features_final.py`
3434
+ - Lines 287-320: Future covariate identification (added temporal)
3435
+ - Lines 382-418: Summary table (added temporal row)
3436
+ - Lines 931-976: Metadata generation (added temporal + weather)
3437
+ - Line 1034: Summary text (87 → 615)
3438
+
3439
+ **Data Files Regenerated**:
3440
+ 1. `data/processed/features_entsoe_24month.parquet`
3441
+ - Size: 10.67 MB
3442
+ - Features: 441 (was 296, +145)
3443
+ - CNEC outages: 176 (was 31, +145)
3444
+
3445
+ 2. `data/processed/features_unified_24month.parquet`
3446
+ - Size: 24.9 MB
3447
+ - Features: 2,553 (was 2,408, +145)
3448
+ - Rows: 17,544 (unchanged)
3449
+
3450
+ 3. `data/processed/features_unified_metadata.csv`
3451
+ - Total features: 2,552 (excludes timestamp)
3452
+ - Future covariates: 615 (was 83, +532)
3453
+ - Historical features: 1,937 (was 2,324, -387)
3454
+
3455
+ ### Key Lessons
3456
+
3457
+ 1. **Zero-filled features are valid**: For inference, model needs columns for ALL possible future events, even if they never occurred historically. Zero-filled CNECs are placeholders for future outages.
3458
+
3459
+ 2. **Future covariate marking is critical**: Chronos 2 uses metadata to know which features extend into the forecast horizon. Missing weather marking would have crippled D+1 to D+14 forecasts.
3460
+
3461
+ 3. **Temporal features are deterministic covariates**: Hour, day, weekday are always known - must be marked as future covariates for model to leverage seasonal/daily patterns.
3462
+
3463
+ 4. **Cleanup logic needs safeguards**: Aggressive removal of zero-variance/duplicate features can delete valid future covariate placeholders. Must explicitly preserve critical feature categories.
3464
+
3465
+ 5. **Extension periods matter**: Different covariates extend different horizons:
3466
+ - D+1: Load forecasts (mask D+2 to D+15)
3467
+ - D+15: Weather (ECMWF forecasts)
3468
+ - D+22: Transmission outages
3469
+ - ∞: Temporal (deterministic)
3470
+
3471
+ ### Inference Strategy (Day 3)
3472
+
3473
+ **Future Covariate Handling**:
3474
+ 1. **Temporal** (12 features): Generate for full D+1 to D+14 horizon (deterministic)
3475
+ 2. **LTA** (40 features): Truncate to D+15 (known years ahead, no need beyond horizon)
3476
+ 3. **Load Forecasts** (12 features): Use D+1 values, mask D+2 to D+15 (Chronos handles missing)
3477
+ 4. **CNEC Outages** (176 features): Collect latest planned outages (up to D+22 available)
3478
+ 5. **Weather** (375 features): Run `scripts/collect_openmeteo_forecast_latest.py` before inference to get fresh D+15 ECMWF forecasts
3479
+
3480
+ **Forecast Extension Pattern**:
3481
+ - Historical data: Oct 2023 - Sept 30, 2025 (17,544 hours)
3482
+ - Inference from: Oct 1, 2025 00:00 onwards
3483
+ - Context window: Last 512 hours (Chronos 2 maximum)
3484
+ - Forecast horizon: D+1 to D+14 (336 hours)
3485
+ - Future covariates: Extend 615 features forward 336 hours
3486
+
3487
+ ### Performance Metrics
3488
+
3489
+ **Re-engineering Time**:
3490
+ - ENTSO-E feature engineering: ~8 minutes
3491
+ - Unification: ~15 seconds
3492
+ - Metadata regeneration: ~2 seconds
3493
+ - Total: ~9 minutes
3494
+
3495
+ **Data Sizes**:
3496
+ - ENTSO-E features: 10.67 MB (was 10.62 MB, +50 KB)
3497
+ - Unified features: 24.9 MB (was ~25 MB, minimal change - zeros compress well)
3498
+ - Metadata: ~80 KB (was ~50 KB, +30 KB)
3499
+
3500
+ ### Next Steps
3501
+
3502
+ **Immediate**: Day 3 - Zero-Shot Inference
3503
+ 1. Create `src/modeling/` directory
3504
+ 2. Implement Chronos 2 inference pipeline:
3505
+ - Load unified features (2,553 features × 17,544 hours)
3506
+ - Identify 615 future covariates from metadata
3507
+ - Collect fresh weather forecasts (D+15)
3508
+ - Generate temporal features for forecast horizon
3509
+ - Prepare context window (last 512 hours)
3510
+ - Run zero-shot inference (D+1 to D+14)
3511
+ - Save predictions
3512
+
3513
+ 3. Performance targets:
3514
+ - Inference time: <5 minutes per 14-day forecast
3515
+ - D+1 MAE: <150 MW (target 134 MW)
3516
+ - Memory: <10 GB (A10G GPU compatible)
3517
+
3518
+ **Documentation Needed**:
3519
+ - Update README with new feature counts
3520
+ - Document future covariate extension strategy
3521
+ - Add inference preprocessing steps
3522
+
3523
+ ---
3524
+
3525
  **Status Update**:
3526
  - Day 0: ✅ Setup complete
3527
  - Day 1: ✅ Data collection complete (JAO, ENTSO-E, Weather)
3528
  - Day 2: ✅ Feature engineering complete (JAO, ENTSO-E, Weather)
3529
+ - Day 2.5: ✅ Feature unification complete (2,408 → 2,553 features)
3530
+ - **Day 2.75: ✅ Future covariate architecture fixed** (615 future covariates)
3531
  - Day 3: ⏳ Zero-shot inference (NEXT)
3532
  - Day 4: ⏳ Evaluation
3533
  - Day 5: ⏳ Documentation + handover
3534
 
3535
  **NEXT SESSION BOOKMARK**: Day 3 - Implement Chronos 2 zero-shot inference pipeline
3536
 
3537
+ **Ready for Inference**: ✅ Unified dataset with complete future covariate architecture
notebooks/05_unified_features_final.py CHANGED
@@ -288,13 +288,17 @@ def identify_future_covariates(pl, unified_df):
288
  """Identify all future covariate features.
289
 
290
  Future covariates:
291
- 1. LTA (lta_*): Known years in advance
292
- 2. Load forecasts (load_forecast_*): D+1
293
- 3. Transmission outages (outage_cnec_*): Variable (check actual data)
294
- 4. Weather (temp_*, wind*, solar_*, cloud*, pressure*): D+10 (ECMWF HRES forecasts)
 
295
  """
296
  future_cov_all_cols = unified_df.columns
297
 
 
 
 
298
  # Identify by prefix
299
  lta_cols = [c for c in future_cov_all_cols if c.startswith('lta_')]
300
  load_forecast_cols = [c for c in future_cov_all_cols if c.startswith('load_forecast_')]
@@ -305,14 +309,15 @@ def identify_future_covariates(pl, unified_df):
305
  weather_cols = [c for c in future_cov_all_cols if any(c.startswith(p) for p in weather_prefixes)]
306
 
307
  future_cov_counts = {
 
308
  'LTA': len(lta_cols),
309
  'Load Forecasts': len(load_forecast_cols),
310
  'Transmission Outages': len(outage_cols),
311
  'Weather': len(weather_cols),
312
- 'Total': len(lta_cols) + len(load_forecast_cols) + len(outage_cols) + len(weather_cols)
313
  }
314
 
315
- return lta_cols, load_forecast_cols, outage_cols, weather_cols, future_cov_counts
316
 
317
 
318
  @app.cell
@@ -379,7 +384,7 @@ def display_future_cov_summary(mo, future_cov_counts, outage_stats):
379
  outage_ext = f"{outage_stats['extension_days']} days" if outage_stats['extension_days'] is not None else "N/A"
380
 
381
  # Calculate percentage of future covariates
382
- total_pct = (future_cov_counts['Total'] / 2410) * 100 # ~2,410 total features
383
 
384
  mo.md(
385
  f"""
@@ -387,6 +392,7 @@ def display_future_cov_summary(mo, future_cov_counts, outage_stats):
387
 
388
  | Category | Count | Extension Period | Description |
389
  |----------|-------|------------------|-------------|
 
390
  | LTA (Long-Term Allocations) | {future_cov_counts['LTA']} | Full horizon (years) | Auction results known in advance |
391
  | Load Forecasts | {future_cov_counts['Load Forecasts']} | D+1 (1 day) | TSO demand forecasts, published daily |
392
  | Transmission Outages | {future_cov_counts['Transmission Outages']} | Up to {outage_ext} | Planned maintenance schedules |
@@ -922,7 +928,7 @@ def section8_header(mo):
922
 
923
 
924
  @app.cell
925
- def create_metadata(pl, categories, lta_cols, load_forecast_cols, outage_cols, outage_stats):
926
  """Create feature metadata file."""
927
  metadata_rows = []
928
 
@@ -939,15 +945,23 @@ def create_metadata(pl, categories, lta_cols, load_forecast_cols, outage_cols, o
939
  source = 'Unknown'
940
 
941
  # Determine if future covariate
942
- meta_is_future = meta_col in lta_cols or meta_col in load_forecast_cols or meta_col in outage_cols
 
 
 
 
943
 
944
  # Determine extension days
945
- if meta_col in lta_cols:
 
 
946
  meta_extension_days = 'Full horizon (years)'
947
  elif meta_col in load_forecast_cols:
948
  meta_extension_days = '1 day (D+1)'
949
  elif meta_col in outage_cols:
950
  meta_extension_days = f"Up to {outage_stats['extension_days']} days" if outage_stats['extension_days'] else 'Variable'
 
 
951
  else:
952
  meta_extension_days = 'N/A (historical)'
953
 
@@ -1017,7 +1031,7 @@ def display_save_info(mo, save_info):
1017
  - [OK] All 3 data sources merged (JAO + ENTSO-E + Weather)
1018
  - [OK] Timestamps standardized to UTC with hourly frequency
1019
  - [OK] {save_info['features_shape'][1] - 1:,} features engineered and cleaned
1020
- - [OK] 87 future covariates identified (LTA, load forecasts, outages)
1021
  - [OK] Data quality validated (>99% completeness)
1022
  - [OK] Standard decimal precision applied
1023
  - [OK] Metadata file created for feature reference
 
288
  """Identify all future covariate features.
289
 
290
  Future covariates:
291
+ 1. Temporal (hour, day, etc.): Known deterministically
292
+ 2. LTA (lta_*): Known years in advance
293
+ 3. Load forecasts (load_forecast_*): D+1
294
+ 4. Transmission outages (outage_cnec_*): Up to D+22
295
+ 5. Weather (temp_*, wind*, solar_*, etc.): D+15 via ECMWF forecasts
296
  """
297
  future_cov_all_cols = unified_df.columns
298
 
299
+ # Temporal features (deterministic)
300
+ temporal_cols = [c for c in future_cov_all_cols if any(x in c for x in ['hour', 'day', 'month', 'weekday', 'year', 'weekend', '_sin', '_cos'])]
301
+
302
  # Identify by prefix
303
  lta_cols = [c for c in future_cov_all_cols if c.startswith('lta_')]
304
  load_forecast_cols = [c for c in future_cov_all_cols if c.startswith('load_forecast_')]
 
309
  weather_cols = [c for c in future_cov_all_cols if any(c.startswith(p) for p in weather_prefixes)]
310
 
311
  future_cov_counts = {
312
+ 'Temporal': len(temporal_cols),
313
  'LTA': len(lta_cols),
314
  'Load Forecasts': len(load_forecast_cols),
315
  'Transmission Outages': len(outage_cols),
316
  'Weather': len(weather_cols),
317
+ 'Total': len(temporal_cols) + len(lta_cols) + len(load_forecast_cols) + len(outage_cols) + len(weather_cols)
318
  }
319
 
320
+ return temporal_cols, lta_cols, load_forecast_cols, outage_cols, weather_cols, future_cov_counts
321
 
322
 
323
  @app.cell
 
384
  outage_ext = f"{outage_stats['extension_days']} days" if outage_stats['extension_days'] is not None else "N/A"
385
 
386
  # Calculate percentage of future covariates
387
+ total_pct = (future_cov_counts['Total'] / 2553) * 100 # ~2,553 total features
388
 
389
  mo.md(
390
  f"""
 
392
 
393
  | Category | Count | Extension Period | Description |
394
  |----------|-------|------------------|-------------|
395
+ | Temporal | {future_cov_counts['Temporal']} | Full horizon (deterministic) | Hour, day, weekday, etc. always known |
396
  | LTA (Long-Term Allocations) | {future_cov_counts['LTA']} | Full horizon (years) | Auction results known in advance |
397
  | Load Forecasts | {future_cov_counts['Load Forecasts']} | D+1 (1 day) | TSO demand forecasts, published daily |
398
  | Transmission Outages | {future_cov_counts['Transmission Outages']} | Up to {outage_ext} | Planned maintenance schedules |
 
928
 
929
 
930
  @app.cell
931
+ def create_metadata(pl, categories, temporal_cols, lta_cols, load_forecast_cols, outage_cols, weather_cols, outage_stats):
932
  """Create feature metadata file."""
933
  metadata_rows = []
934
 
 
945
  source = 'Unknown'
946
 
947
  # Determine if future covariate
948
+ meta_is_future = (meta_col in temporal_cols or
949
+ meta_col in lta_cols or
950
+ meta_col in load_forecast_cols or
951
+ meta_col in outage_cols or
952
+ meta_col in weather_cols)
953
 
954
  # Determine extension days
955
+ if meta_col in temporal_cols:
956
+ meta_extension_days = 'Full horizon (deterministic)'
957
+ elif meta_col in lta_cols:
958
  meta_extension_days = 'Full horizon (years)'
959
  elif meta_col in load_forecast_cols:
960
  meta_extension_days = '1 day (D+1)'
961
  elif meta_col in outage_cols:
962
  meta_extension_days = f"Up to {outage_stats['extension_days']} days" if outage_stats['extension_days'] else 'Variable'
963
+ elif meta_col in weather_cols:
964
+ meta_extension_days = '15 days (D+15 ECMWF)'
965
  else:
966
  meta_extension_days = 'N/A (historical)'
967
 
 
1031
  - [OK] All 3 data sources merged (JAO + ENTSO-E + Weather)
1032
  - [OK] Timestamps standardized to UTC with hourly frequency
1033
  - [OK] {save_info['features_shape'][1] - 1:,} features engineered and cleaned
1034
+ - [OK] 615 future covariates identified (temporal, LTA, load forecasts, outages, weather)
1035
  - [OK] Data quality validated (>99% completeness)
1036
  - [OK] Standard decimal precision applied
1037
  - [OK] Metadata file created for feature reference
src/feature_engineering/engineer_entsoe_features.py CHANGED
@@ -841,34 +841,45 @@ def engineer_all_entsoe_features(
841
  features = features.drop(list(null_pcts.keys()))
842
 
843
  # Remove zero-variance features (constants)
 
844
  zero_var_cols = []
845
  for col in features.columns:
846
  if col != 'timestamp':
 
 
 
847
  # Check if all values are the same (excluding nulls)
848
  non_null = features[col].drop_nulls()
849
  if len(non_null) > 0 and non_null.n_unique() == 1:
850
  zero_var_cols.append(col)
851
 
852
  if zero_var_cols:
853
- print(f"\nRemoving {len(zero_var_cols)} zero-variance features")
854
  features = features.drop(zero_var_cols)
855
 
856
  # Remove duplicate columns
 
857
  dup_groups = {}
858
  cols_to_check = [c for c in features.columns if c != 'timestamp']
859
 
860
  for i, col1 in enumerate(cols_to_check):
861
  if col1 in dup_groups.values(): # Already marked as duplicate
862
  continue
 
 
 
863
  for col2 in cols_to_check[i+1:]:
864
  if col2 in dup_groups.values(): # Already marked as duplicate
865
  continue
 
 
 
866
  # Check if columns are identical
867
  if features[col1].equals(features[col2]):
868
  dup_groups[col2] = col1 # col2 is duplicate of col1
869
 
870
  if dup_groups:
871
- print(f"\nRemoving {len(dup_groups)} duplicate columns (keeping first occurrence)")
872
  features = features.drop(list(dup_groups.keys()))
873
 
874
  features_after = len(features.columns) - 1
 
841
  features = features.drop(list(null_pcts.keys()))
842
 
843
  # Remove zero-variance features (constants)
844
+ # EXCEPT transmission outage features - keep them even if zero-filled for future inference
845
  zero_var_cols = []
846
  for col in features.columns:
847
  if col != 'timestamp':
848
+ # Skip transmission outage features (needed for future inference)
849
+ if col.startswith('outage_cnec_'):
850
+ continue
851
  # Check if all values are the same (excluding nulls)
852
  non_null = features[col].drop_nulls()
853
  if len(non_null) > 0 and non_null.n_unique() == 1:
854
  zero_var_cols.append(col)
855
 
856
  if zero_var_cols:
857
+ print(f"\nRemoving {len(zero_var_cols)} zero-variance features (keeping transmission outages)")
858
  features = features.drop(zero_var_cols)
859
 
860
  # Remove duplicate columns
861
+ # EXCEPT transmission outage features - keep all CNECs even if identical (all zeros)
862
  dup_groups = {}
863
  cols_to_check = [c for c in features.columns if c != 'timestamp']
864
 
865
  for i, col1 in enumerate(cols_to_check):
866
  if col1 in dup_groups.values(): # Already marked as duplicate
867
  continue
868
+ # Skip transmission outage features (each CNEC needs its own column for inference)
869
+ if col1.startswith('outage_cnec_'):
870
+ continue
871
  for col2 in cols_to_check[i+1:]:
872
  if col2 in dup_groups.values(): # Already marked as duplicate
873
  continue
874
+ # Skip transmission outage features
875
+ if col2.startswith('outage_cnec_'):
876
+ continue
877
  # Check if columns are identical
878
  if features[col1].equals(features[col2]):
879
  dup_groups[col2] = col1 # col2 is duplicate of col1
880
 
881
  if dup_groups:
882
+ print(f"\nRemoving {len(dup_groups)} duplicate columns (keeping transmission outages)")
883
  features = features.drop(list(dup_groups.keys()))
884
 
885
  features_after = len(features.columns) - 1