Spaces:
Sleeping
Sleeping
Evgueni Poloukarov
feat: complete Marimo data exploration notebook with FBMC methodology documentation
82da022
| # JAO Data Treatment & Feature Extraction Plan | |
| ## Complete Guide for FBMC Flow Forecasting MVP | |
| **Last Updated:** October 29, 2025 | |
| **Project:** FBMC Zero-Shot Forecasting with Chronos 2 | |
| **Timeline:** 5-Day MVP | |
| **Version:** 5.0 - OUTAGE DURATION FEATURES (Added temporal outage metrics for enhanced forecasting) | |
| --- | |
| ## Table of Contents | |
| 1. [Data Acquisition Strategy](#1-data-acquisition-strategy) | |
| 2. [JAO Data Series Catalog](#2-jao-data-series-catalog) | |
| 3. [Data Cleaning Procedures](#3-data-cleaning-procedures) | |
| 4. [CNEC Treatment](#4-cnec-treatment) | |
| 5. [PTDF Treatment](#5-ptdf-treatment) | |
| 6. [RAM Treatment](#6-ram-treatment) | |
| 7. [Shadow Prices Treatment](#7-shadow-prices-treatment) | |
| 8. [Feature Engineering Pipeline](#8-feature-engineering-pipeline) | |
| 9. [Quality Assurance](#9-quality-assurance) | |
| --- | |
| ## 1. Data Acquisition Strategy | |
| ### 1.1 Timeline and Scope | |
| **Historical Data Period:** October 2023 - September 2025 (24 months) | |
| - Purpose: Feature baselines and historical context | |
| - Total: ~17,520 hourly records per series | |
| **Collection Priority:** | |
| 1. **Day 1 Morning (2 hours):** Max BEX files (TARGET VARIABLE - highest priority!) | |
| 2. **Day 1 Morning (2 hours):** CNEC files (critical for constraint identification) | |
| 3. **Day 1 Morning (2 hours):** PTDF matrices (network sensitivity) | |
| 4. **Day 1 Afternoon (1.5 hours):** LTN values (future covariates) | |
| 5. **Day 1 Afternoon (1.5 hours):** Min/Max Net Positions (domain boundaries) | |
| 6. **Day 1 Afternoon (1.5 hours):** ATC Non-Core borders (loop flow drivers) | |
| 7. **Day 1 Afternoon (1.5 hours):** RAM values (capacity margins) | |
| 8. **Day 1 Afternoon (1.5 hours):** Shadow prices (economic signals) | |
| ### 1.2 jao-py Python Library Usage | |
| **Installation & Setup:** | |
| ```bash | |
| # Install jao-py using uv package manager | |
| .venv/Scripts/uv.exe pip install jao-py>=0.6.2 | |
| # Verify installation | |
| .venv/Scripts/python.exe -c "from jao import JaoPublicationToolPandasClient; print('jao-py installed successfully')" | |
| ``` | |
| **Python Data Collection:** | |
| ```python | |
| import pandas as pd | |
| from jao import JaoPublicationToolPandasClient | |
| import time | |
| # Initialize client (no API key required for public data) | |
| client = JaoPublicationToolPandasClient() | |
| # Define date range (24 months: Oct 2023 - Sept 2025) | |
| # IMPORTANT: jao-py requires pandas Timestamp with timezone (UTC) | |
| start_date = pd.Timestamp('2023-10-01', tz='UTC') | |
| end_date = pd.Timestamp('2025-09-30', tz='UTC') | |
| # Collect MaxBEX data (TARGET VARIABLE) - day by day with rate limiting | |
| maxbex_data = [] | |
| current_date = start_date | |
| while current_date <= end_date: | |
| df = client.query_maxbex(current_date) | |
| maxbex_data.append(df) | |
| current_date += pd.Timedelta(days=1) | |
| time.sleep(5) # Rate limiting: 5 seconds between requests | |
| # Combine and save | |
| maxbex_df = pd.concat(maxbex_data) | |
| maxbex_df.to_parquet('data/raw/jao/maxbex_2023_2025.parquet') | |
| # Collect Active Constraints (CNECs + PTDFs in ONE call!) | |
| cnec_data = [] | |
| current_date = start_date | |
| while current_date <= end_date: | |
| df = client.query_active_constraints(current_date) | |
| cnec_data.append(df) | |
| current_date += pd.Timedelta(days=1) | |
| time.sleep(5) # Rate limiting: 5 seconds between requests | |
| # Combine and save | |
| cnec_df = pd.concat(cnec_data) | |
| cnec_df.to_parquet('data/raw/jao/cnecs_ptdfs_2023_2025.parquet') | |
| ``` | |
| **Key Methods Available:** | |
| - `query_maxbex(day)` - Maximum Bilateral Exchange capacity (TARGET) | |
| - `query_active_constraints(day)` - CNECs with shadow prices, RAM, and PTDFs | |
| - `query_lta(d_from, d_to)` - Long Term Allocations (perfect future covariate) | |
| - `query_minmax_np(day)` - Min/Max Net Positions (domain boundaries) | |
| - `query_net_position(day)` - Actual net positions | |
| - `query_monitoring(day)` - Monitoring data (additional RAM/shadow prices) | |
| ``` | |
| **Expected Output Files:** | |
| ``` | |
| data/raw/jao/ | |
| ââ€Âωâ€Âۉâ€Â€ cnecs_2024_2025.parquet (~500 MB) | |
| ââ€Âωâ€Âۉâ€Â€ ptdfs_2024_2025.parquet (~800 MB) | |
| ââ€Âωâ€Âۉâ€Â€ rams_2024_2025.parquet (~400 MB) | |
| ââ€Âωâ€Âۉâ€Â€ shadow_prices_2024_2025.parquet (~300 MB) | |
| ââ€Âωâ€Âۉâ€Â€ presolved_2024_2025.parquet (~200 MB) | |
| ââ€Âωâ€Âۉâ€Â€ d2cf_2024_2025.parquet (~600 MB - OPTIONAL) | |
| ââ€Ââ€Âââ€Âۉâ€Â€ metadata/ | |
| ââ€Âωâ€Âۉâ€Â€ cnec_definitions.parquet | |
| ââ€Âωâ€Âۉâ€Â€ zone_ptdf_mapping.parquet | |
| ââ€Ââ€Âââ€Âۉâ€Â€ download_log.json | |
| ``` | |
| --- | |
| ## 2. JAO Data Series Catalog | |
| ### 2.1 Max BEX Data Series (TARGET VARIABLE) | |
| **File:** `Core_DA_Results_[date].xml` or `Core_Max_Exchanges_[date].xml` | |
| **CRITICAL:** This is the **TARGET VARIABLE** we are forecasting - without Max BEX, the model cannot be trained or evaluated! | |
| | Field Name | Data Type | Description | Use in Features | Missing Value Strategy | | |
| |------------|-----------|-------------|-----------------|----------------------| | |
| | `border_id` | String | Border identifier (e.g., "DE-FR", "DE-NL") | Border tracking | **REQUIRED** | | |
| | `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** | | |
| | `max_bex` | Float (MW) | Maximum Bilateral Exchange Capacity | **TARGET VARIABLE** | **CRITICAL - must have** | | |
| | `direction` | String | Direction of capacity (e.g., "DE→FR") | Directional features | Parse from border | | |
| **Description:** | |
| Max BEX represents the **actual available cross-border capacity** after FBMC optimization. It is the OUTPUT of the FBMC calculation process and represents how much capacity is actually available for bilateral exchanges after accounting for: | |
| - All CNEC constraints (RAM limits) | |
| - PTDF sensitivities | |
| - LTN allocations (capacity already committed) | |
| - Remedial actions | |
| - All network constraints | |
| **IMPORTANT: Commercial vs Physical Capacity** | |
| - MaxBEX = **commercial hub-to-hub trading capacity**, NOT physical interconnector ratings | |
| - FBMC Core has 12 bidding zones: AT, BE, CZ, DE-LU, FR, HR, HU, NL, PL, RO, SI, SK | |
| - **All 132 zone-pair combinations exist** (12 × 11 bidirectional pairs) | |
| - This includes "virtual borders" - zone pairs without direct physical interconnectors | |
| - Example: FR→HU capacity exists despite no physical FR-HU interconnector | |
| - Power flows through AC grid network via DE, AT, CZ | |
| - PTDFs quantify how this exchange affects every CNEC in the network | |
| - MaxBEX = optimization result considering ALL network constraints | |
| **Borders to Collect (ALL 132 zone pairs):** | |
| - JAO provides MaxBEX for all 12 × 11 = 132 bidirectional zone-pair combinations | |
| - Includes both physical borders (e.g., DE→FR, AT→CZ) and virtual borders (e.g., FR→HU, BE→PL) | |
| - Each direction (A→B vs B→A) has independent capacity values | |
| - Full list: All permutations of (AT, BE, CZ, DE, FR, HR, HU, NL, PL, RO, SI, SK) | |
| - Example physical borders: DE→FR, DE→NL, AT→CZ, BE→NL, FR→BE | |
| - Example virtual borders: FR→HU, AT→HR, BE→RO, NL→SK, CZ→HR | |
| **Collection Priority:** DAY 1 MORNING (FIRST PRIORITY) | |
| This is the ground truth for training and validation - collect before any other data series! | |
| **Storage Schema:** | |
| ```python | |
| max_bex_schema = { | |
| 'timestamp': pl.Datetime, | |
| 'border_id': pl.Utf8, | |
| 'max_bex': pl.Float32, # TARGET VARIABLE | |
| 'direction': pl.Utf8 | |
| } | |
| ``` | |
| **Expected File Size:** ~1.2-1.5 GB for 24 months × 132 zone pairs × 17,520 hours (actual: ~200 MB compressed as Parquet) | |
| --- | |
| ### 2.2 LTN Data Series (Long Term Nominations) | |
| **File:** `Core_LTN_[date].xml` or available through `Core_ltn.R` JAOPuTo function | |
| **CRITICAL:** LTN values are **PERFECT FUTURE COVARIATES** because they result from yearly/monthly capacity auctions and are known with certainty for the forecast horizon! | |
| | Field Name | Data Type | Description | Use in Features | Missing Value Strategy | | |
| |------------|-----------|-------------|-----------------|----------------------| | |
| | `border_id` | String | Border identifier (e.g., "DE-FR", "SI-HR") | Border tracking | **REQUIRED** | | |
| | `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** | | |
| | `ltn_allocated` | Float (MW) | Capacity already allocated in long-term auctions | **INPUT FEATURE & FUTURE COVARIATE** | 0.0 (no LT allocation) | | |
| | `direction` | String | Direction of nomination | Directional features | Parse from border | | |
| | `auction_type` | String | "yearly" or "monthly" | Allocation source tracking | Required field | | |
| **Description:** | |
| LTN represents capacity already committed through long-term (yearly and monthly) auctions conducted by JAO. This capacity is **guaranteed and known in advance**, making it an exceptionally valuable feature: | |
| **Historical Context (What Happened):** | |
| - LTN values show how much capacity was already allocated | |
| - Higher LTN → lower available day-ahead capacity (Max BEX) | |
| - Direct relationship: `Max BEX ≈ Theoretical Max - LTN - Other Constraints` | |
| **Future Covariates (What We Know Will Happen):** | |
| - **Yearly auctions:** Results are known for the **entire year ahead** | |
| - **Monthly auctions:** Results are known for the **month ahead** | |
| - When forecasting D+1 to D+14, LTN values are **100% certain** (already allocated) | |
| - This is **GOLD STANDARD** for future covariates - no forecasting needed! | |
| **Borders with LTN:** | |
| Most Core borders have LTN=0 (Financial Transmission Rights), but important exceptions: | |
| - **SI-HR** (Slovenia-Croatia): Physical Transmission Rights (PTRs) still in use | |
| - Some external Core borders may have PTRs | |
| - Even when LTN=0, including the series confirms this to the model | |
| **Impact on Max BEX:** | |
| Example: If DE-FR has 500 MW LTN allocated: | |
| - Theoretical capacity: 3000 MW | |
| - LTN reduction: -500 MW | |
| - After network constraints: -400 MW | |
| - **Result: Max BEX ≈ 2100 MW** | |
| **Collection Priority:** DAY 1 AFTERNOON (HIGH PRIORITY - future covariates) | |
| **Storage Schema:** | |
| ```python | |
| ltn_schema = { | |
| 'timestamp': pl.Datetime, | |
| 'border_id': pl.Utf8, | |
| 'ltn_allocated': pl.Float32, # Amount already committed | |
| 'direction': pl.Utf8, | |
| 'auction_type': pl.Utf8 # yearly/monthly | |
| } | |
| ``` | |
| **Expected File Size:** ~100-160 MB for 24 months × 20 borders × 17,520 hours | |
| **JAOPuTo Download Command:** | |
| ```bash | |
| # Download LTN data for 24 months | |
| java -jar JAOPuTo.jar \ | |
| --start-date 2023-10-01 \ | |
| --end-date 2025-09-30 \ | |
| --data-type LTN \ | |
| --region CORE \ | |
| --output-format parquet \ | |
| --output-dir ./data/raw/jao/ltn/ | |
| ``` | |
| **Future Covariate Usage:** | |
| ```python | |
| # LTN is known in advance from auction results | |
| # For forecast period D+1 to D+14, we have PERFECT information | |
| def prepare_ltn_future_covariates(ltn_df, forecast_start_date, prediction_horizon_hours=336): | |
| """ | |
| Extract LTN values for future forecast horizon | |
| These are KNOWN with certainty (auction results) | |
| """ | |
| forecast_end_date = forecast_start_date + pd.Timedelta(hours=prediction_horizon_hours) | |
| future_ltn = ltn_df.filter( | |
| (pl.col('timestamp') >= forecast_start_date) & | |
| (pl.col('timestamp') < forecast_end_date) | |
| ) | |
| return future_ltn # No forecasting needed - these are actual commitments! | |
| ``` | |
| --- | |
| ### 2.3 Min/Max Net Position Data Series | |
| **File:** `Core_MaxNetPositions_[date].xml` or available through `Core_maxnetpositions.R` JAOPuTo function | |
| **CRITICAL:** Min/Max Net Positions define the **feasible domain** for each bidding zone - the boundaries within which net positions can move without violating network constraints. | |
| | Field Name | Data Type | Description | Use in Features | Missing Value Strategy | | |
| |------------|-----------|-------------|-----------------|----------------------| | |
| | `zone` | String | Bidding zone (e.g., "DE_LU", "FR", "BE") | Zone tracking | **REQUIRED** | | |
| | `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** | | |
| | `net_pos_min` | Float (MW) | Minimum feasible net position | **DOMAIN BOUNDARY** | **CRITICAL - must have** | | |
| | `net_pos_max` | Float (MW) | Maximum feasible net position | **DOMAIN BOUNDARY** | **CRITICAL - must have** | | |
| | `actual_net_pos` | Float (MW) | Actual net position after market clearing | Reference value | 0.0 if missing | | |
| **Description:** | |
| Min/Max Net Positions represent the **feasible space** in which each bidding zone can operate without violating any CNEC constraints. These values are extracted from the union of: | |
| - Final flow-based domain | |
| - Final bilateral exchange restrictions (LTA inclusion) | |
| - All CNEC constraints simultaneously satisfied | |
| **Why This Matters for Forecasting:** | |
| 1. **Tight ranges indicate constrained systems:** | |
| - If `net_pos_max - net_pos_min` is small → limited flexibility → likely lower Max BEX | |
| - If range is large → system has degrees of freedom → potentially higher Max BEX | |
| 2. **Zone-specific stress indicators:** | |
| - A zone with narrow net position range is "boxed in" | |
| - This propagates to adjacent borders (reduces their capacity) | |
| 3. **Derived features:** | |
| ```python | |
| # Range = degrees of freedom | |
| net_pos_range = net_pos_max - net_pos_min | |
| # Margin = how close to limits (utilization) | |
| net_pos_margin = (actual_net_pos - net_pos_min) / (net_pos_max - net_pos_min) | |
| # Stress index = inverse of range (tighter = more stressed) | |
| zone_stress = 1.0 / (net_pos_range + 1.0) # +1 to avoid division by zero | |
| ``` | |
| **Example Interpretation:** | |
| For Germany (DE_LU) at a specific hour: | |
| - `net_pos_min` = -8000 MW (maximum import) | |
| - `net_pos_max` = +12000 MW (maximum export) | |
| - `actual_net_pos` = +6000 MW (actual export after market clearing) | |
| - **Range** = 20,000 MW (large flexibility) | |
| - **Margin** = (6000 - (-8000)) / 20000 = 0.70 (70% utilization toward export limit) | |
| If another hour shows: | |
| - `net_pos_min` = -2000 MW | |
| - `net_pos_max` = +3000 MW | |
| - **Range** = 5,000 MW (VERY constrained - system is tight!) | |
| - This hour will likely have lower Max BEX on German borders | |
| **Zones to Collect (12 Core zones):** | |
| - DE_LU (Germany-Luxembourg) | |
| - FR (France) | |
| - BE (Belgium) | |
| - NL (Netherlands) | |
| - AT (Austria) | |
| - CZ (Czech Republic) | |
| - PL (Poland) | |
| - SK (Slovakia) | |
| - HU (Hungary) | |
| - SI (Slovenia) | |
| - HR (Croatia) | |
| - RO (Romania) | |
| **Collection Priority:** DAY 1 AFTERNOON (HIGH PRIORITY) | |
| **Storage Schema:** | |
| ```python | |
| netpos_schema = { | |
| 'timestamp': pl.Datetime, | |
| 'zone': pl.Utf8, | |
| 'net_pos_min': pl.Float32, # Minimum feasible (import limit) | |
| 'net_pos_max': pl.Float32, # Maximum feasible (export limit) | |
| 'actual_net_pos': pl.Float32 # Actual market clearing value | |
| } | |
| ``` | |
| **Expected File Size:** ~160-200 MB for 24 months × 12 zones × 17,520 hours | |
| **JAOPuTo Download Command:** | |
| ```bash | |
| # Download Min/Max Net Positions for 24 months | |
| java -jar JAOPuTo.jar \ | |
| --start-date 2023-10-01 \ | |
| --end-date 2025-09-30 \ | |
| --data-type MAX_NET_POSITIONS \ | |
| --region CORE \ | |
| --output-format parquet \ | |
| --output-dir ./data/raw/jao/netpos/ | |
| ``` | |
| **Feature Engineering:** | |
| ```python | |
| def engineer_netpos_features(df: pl.DataFrame) -> pl.DataFrame: | |
| """ | |
| Create net position features from Min/Max data | |
| """ | |
| df = df.with_columns([ | |
| # 1. Range (degrees of freedom) | |
| (pl.col('net_pos_max') - pl.col('net_pos_min')) | |
| .alias('net_pos_range'), | |
| # 2. Utilization margin (0 = at min, 1 = at max) | |
| ((pl.col('actual_net_pos') - pl.col('net_pos_min')) / | |
| (pl.col('net_pos_max') - pl.col('net_pos_min'))) | |
| .alias('net_pos_margin'), | |
| # 3. Stress index (inverse of range, normalized) | |
| (1.0 / (pl.col('net_pos_max') - pl.col('net_pos_min') + 100.0)) | |
| .alias('zone_stress_index'), | |
| # 4. Distance to limits (minimum of distances to both boundaries) | |
| pl.min_horizontal([ | |
| pl.col('actual_net_pos') - pl.col('net_pos_min'), | |
| pl.col('net_pos_max') - pl.col('actual_net_pos') | |
| ]).alias('distance_to_nearest_limit'), | |
| # 5. Asymmetry (is the feasible space symmetric around zero?) | |
| ((pl.col('net_pos_max') + pl.col('net_pos_min')) / 2.0) | |
| .alias('domain_asymmetry') | |
| ]) | |
| return df | |
| ``` | |
| **System-Level Aggregations:** | |
| ```python | |
| def aggregate_netpos_system_wide(df: pl.DataFrame) -> pl.DataFrame: | |
| """ | |
| Create system-wide net position stress indicators | |
| """ | |
| system_features = df.groupby('timestamp').agg([ | |
| # Average range across all zones | |
| pl.col('net_pos_range').mean().alias('avg_zone_flexibility'), | |
| # Minimum range (most constrained zone) | |
| pl.col('net_pos_range').min().alias('min_zone_flexibility'), | |
| # Count of highly constrained zones (range < 5000 MW) | |
| (pl.col('net_pos_range') < 5000).sum().alias('n_constrained_zones'), | |
| # System-wide stress (average of zone stress indices) | |
| pl.col('zone_stress_index').mean().alias('system_stress_avg'), | |
| # Maximum zone stress (tightest zone) | |
| pl.col('zone_stress_index').max().alias('system_stress_max') | |
| ]) | |
| return system_features | |
| ``` | |
| --- | |
| ### 2.4 CNEC Data Series | |
| **File:** `Core_DA_CC_CNEC_[date].xml` (daily publication at 10:30 CET) | |
| | Field Name | Data Type | Description | Use in Features | Missing Value Strategy | | |
| |------------|-----------|-------------|-----------------|----------------------| | |
| | `cnec_id` | String | Unique identifier (e.g., "DE_CZ_TIE_001") | CNEC tracking | N/A - Required field | | |
| | `cnec_name` | String | Human-readable name (e.g., "Line Röhrsdorf-Hradec N-1 ABC") | Documentation | Forward-fill | | |
| | `contingency` | String | N-1 element causing constraint | CNEC classification | "basecase" if missing | | |
| | `tso` | String | Responsible TSO (e.g., "50Hertz", "ÄŒEPS") | Geographic features | Parse from `cnec_id` | | |
| | `monitored_element` | String | Physical line/transformer being monitored | CNEC grouping | Required field | | |
| | `fmax` | Float (MW) | Maximum flow limit under contingency | Normalization baseline | **CRITICAL - must have** | | |
| | `ram_before` | Float (MW) | Initial remaining available margin | Historical patterns | 0.0 (conservative) | | |
| | `ram_after` | Float (MW) | Final RAM after remedial actions | **Primary feature** | Forward-fill from `ram_before` | | |
| | `flow_fb` | Float (MW) | Final flow after market coupling | Flow patterns | 0.0 if unconstrained | | |
| | `presolved` | Boolean | Was CNEC binding/active? | **Key target feature** | False (not binding) | | |
| | `shadow_price` | Float (€/MW) | Lagrange multiplier for constraint | Economic signal | 0.0 if not binding | | |
| | `direction` | String | "import" or "export" from perspective | Directionality | Parse from flow sign | | |
| | `voltage_level` | Integer (kV) | Voltage level (e.g., 220, 380) | CNEC classification | 380 (most common) | | |
| | `timestamp` | Datetime | Delivery hour (CET/CEST) | Time index | **REQUIRED** | | |
| | `business_day` | Date | Market day (D for delivery D+1) | File organization | Derive from timestamp | | |
| **Storage Schema:** | |
| ```python | |
| cnec_schema = { | |
| 'timestamp': pl.Datetime, | |
| 'business_day': pl.Date, | |
| 'cnec_id': pl.Utf8, | |
| 'cnec_name': pl.Utf8, | |
| 'tso': pl.Utf8, | |
| 'fmax': pl.Float32, | |
| 'ram_after': pl.Float32, # Primary field | |
| 'ram_before': pl.Float32, | |
| 'flow_fb': pl.Float32, | |
| 'presolved': pl.Boolean, | |
| 'shadow_price': pl.Float32, | |
| 'contingency': pl.Utf8, | |
| 'voltage_level': pl.Int16 | |
| } | |
| ``` | |
| **Collection Decision:** | |
| - ✅ **COLLECT:** All fields above | |
| - âÂÂÅ’ **SKIP:** Internal TSO-specific IDs, validation flags, intermediate calculation steps | |
| ### 2.5 PTDF Data Series | |
| **File:** `Core_DA_CC_PTDF_[date].xml` (published D-2 and updated D-1) | |
| | Field Name | Data Type | Description | Use in Features | Missing Value Strategy | | |
| |------------|-----------|-------------|-----------------|----------------------| | |
| | `cnec_id` | String | Links to CNEC | Join key | **REQUIRED** | | |
| | `zone` | String | Bidding zone (e.g., "DE_LU", "FR") | Sensitivity mapping | **REQUIRED** | | |
| | `ptdf_value` | Float | Sensitivity coefficient (-1.0 to +1.0) | **Core feature** | 0.0 (no impact) | | |
| | `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** | | |
| | `version` | String | "D-2" or "D-1" | Data freshness | Use latest available | | |
| | `is_hvdc` | Boolean | Is this a HVDC link sensitivity? | Separate treatment | False | | |
| **PTDF Matrix Structure:** | |
| ``` | |
| DE_LU FR NL BE AT CZ PL ... (12 zones) | |
| CNEC_001 0.42 -0.18 0.15 0.08 0.05 0.12 -0.02 ... | |
| CNEC_002 -0.35 0.67 -0.12 0.25 0.03 -0.08 0.15 ... | |
| CNEC_003 0.28 -0.22 0.45 -0.15 0.18 0.08 -0.05 ... | |
| ... | |
| CNEC_200 ... ... ... ... ... ... ... ... | |
| ``` | |
| **Collection Decision:** | |
| - ✅ **COLLECT:** `ptdf_value` for all zones Ã all CNECs Ã all hours | |
| - ✅ **COLLECT:** Only "D-1" version (most recent available for historical data) | |
| - âÂÂÅ’ **SKIP:** D-2 version, intermediate updates, HVDC-specific matrices | |
| **Dimensionality:** | |
| - Raw: ~200 CNECs Ã 12 zones Ã 17,520 hours = ~42 million values | |
| - **Hybrid Storage:** Top 50 CNECs with individual PTDFs, Tier-2 with border aggregates | |
| ### 2.6 RAM Data Series | |
| **File:** `Core_DA_CC_CNEC_[date].xml` (within CNEC records) | |
| | Field Name | Data Type | Description | Use in Features | Missing Value Strategy | | |
| |------------|-----------|-------------|-----------------|----------------------| | |
| | `cnec_id` | String | CNEC identifier | Join key | **REQUIRED** | | |
| | `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** | | |
| | `minram` | Float (MW) | Minimum RAM (70% rule) | Compliance checking | 0.7 Ã fmax | | |
| | `initial_ram` | Float (MW) | RAM before coordination | Historical baseline | Forward-fill | | |
| | `final_ram` | Float (MW) | RAM after validation/remedial actions | **Primary feature** | **CRITICAL** | | |
| | `ram_net_position` | Float (MW) | Net position at which RAM calculated | Market condition | 0.0 | | |
| | `validation_adjustment` | Float (MW) | TSO adjustments during validation | Adjustment patterns | 0.0 | | |
| | `remedial_actions_applied` | Boolean | Were remedial actions used? | Constraint stress | False | | |
| **Collection Decision:** | |
| - ✅ **COLLECT:** `final_ram`, `minram`, `initial_ram` | |
| - ✅ **COLLECT:** `remedial_actions_applied` (binary indicator) | |
| - âÂÂÅ’ **SKIP:** Detailed remedial action descriptions (too granular) | |
| ### 2.7 Shadow Price Data Series | |
| **File:** `Core_DA_Results_[date].xml` (post-market publication) | |
| | Field Name | Data Type | Description | Use in Features | Missing Value Strategy | | |
| |------------|-----------|-------------|-----------------|----------------------| | |
| | `cnec_id` | String | CNEC identifier | Join key | **REQUIRED** | | |
| | `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** | | |
| | `shadow_price` | Float (€/MW) | Lagrange multiplier | Economic signal | 0.0 | | |
| | `shadow_price_before_minram` | Float (€/MW) | Before 70% rule applied | Pre-constraint value | 0.0 | | |
| | `binding_duration` | Integer (minutes) | How long CNEC was binding | Persistence | 0 | | |
| | `peak_shadow_price_4h` | Float (€/MW) | Max in 4-hour window | Volatility signal | shadow_price | | |
| **Collection Decision:** | |
| - ✅ **COLLECT:** `shadow_price` (primary) | |
| - ✅ **COLLECT:** `binding_duration` (for persistence features) | |
| - âÂÂÅ’ **SKIP:** `shadow_price_before_minram` (less relevant for forecasting) | |
| --- | |
| ### 2.8 D2CF (Day-2 Congestion Forecast) - OPTIONAL | |
| **Decision: SKIP for MVP** | |
| **Rationale:** | |
| 1. **Temporal misalignment:** D2CF provides TSO forecasts for D+2, but we're forecasting D+1 to D+14 | |
| 2. **Forecast contamination:** These are forecasts, not ground truth - introduces noise | |
| 3. **Better alternatives:** ENTSO-E actual generation/load provides cleaner signals | |
| 4. **Scope reduction:** Keeps MVP focused on high-signal features | |
| **If collected later:** | |
| - `load_forecast_d2cf`: TSO load forecast | |
| - `generation_forecast_d2cf`: TSO generation forecast | |
| - `net_position_forecast_d2cf`: Expected net position | |
| --- | |
| ### 2.9 ATC for Non-Core Borders Data Series | |
| **File:** `Core_ATC_External_Borders_[date].xml` or ENTSO-E Transparency Platform | |
| | Field Name | Data Type | Description | Use in Features | Missing Value Strategy | | |
| |------------|-----------|-------------|-----------------|----------------------| | |
| | `border_id` | String | Non-Core border (e.g., "FR-UK", "FR-ES") | Border tracking | **REQUIRED** | | |
| | `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** | | |
| | `atc_forward` | Float (MW) | Available Transfer Capacity (forward direction) | Loop flow driver | 0.0 | | |
| | `atc_backward` | Float (MW) | Available Transfer Capacity (backward direction) | Loop flow driver | 0.0 | | |
| **Description:** | |
| Available Transfer Capacity on borders connecting Core to non-Core regions. These affect loop flows through Core and impact Core CNEC constraints. | |
| **Why This Matters:** | |
| Large flows on external borders create loop flows through Core: | |
| - **FR-UK flows** affect FR-BE, FR-DE capacities via loop flows | |
| - **Swiss transit** (FR-CH-AT-DE) impacts Core CNECs significantly | |
| - **Nordic flows** through DE-DK affect German internal CNECs | |
| - **Italian flows** through AT-IT affect Austrian CNECs | |
| **Key Non-Core Borders to Collect (~14 borders):** | |
| 1. **FR-UK** (IFA, ElecLink, IFA2 interconnectors) - Major loop flow driver | |
| 2. **FR-ES** (Pyrenees corridor) | |
| 3. **FR-CH** (Alpine corridor to Switzerland) | |
| 4. **DE-CH** (Basel-Laufenburg) | |
| 5. **AT-CH** (Alpine corridor) | |
| 6. **DE-DK** (to Nordic region) | |
| 7. **PL-SE** (Baltic to Nordic) | |
| 8. **PL-LT** (Baltic connection) | |
| 9. **AT-SI** (SEE gateway) | |
| 10. **AT-IT** (Brenner Pass) | |
| 11. **HU-RO** (SEE connection) | |
| 12. **HU-RS** (Balkans) | |
| 13. **HR-RS** (Balkans) | |
| 14. **SI-IT** (SEE to Italy) | |
| **Collection Method:** | |
| 1. **JAO Publication Tool:** Core_ATC_External_Borders | |
| 2. **ENTSO-E Transparency Platform:** Fallback if JAO incomplete | |
| **Collection Priority:** DAY 1 AFTERNOON (after core data) | |
| **Storage Schema:** | |
| ```python | |
| atc_external_schema = { | |
| 'timestamp': pl.Datetime, | |
| 'border_id': pl.Utf8, | |
| 'atc_forward': pl.Float32, | |
| 'atc_backward': pl.Float32 | |
| } | |
| ``` | |
| **Expected File Size:** ~120-160 MB for 24 months × 14 borders | |
| **ENTSO-E API Collection (if needed):** | |
| ```python | |
| from entsoe import EntsoePandasClient | |
| non_core_borders = [ | |
| ('FR', 'UK'), ('FR', 'ES'), ('FR', 'CH'), | |
| ('DE', 'CH'), ('DE', 'DK'), | |
| ('AT', 'CH'), ('AT', 'IT'), | |
| ('PL', 'SE'), ('PL', 'LT') | |
| ] | |
| for zone1, zone2 in non_core_borders: | |
| atc_data = client.query_offered_capacity( | |
| country_code_from=zone1, | |
| country_code_to=zone2, | |
| start=start_date, | |
| end=end_date | |
| ) | |
| ``` | |
| --- | |
| ## 3. Data Cleaning Procedures | |
| ### 3.1 Missing Value Handling | |
| **Priority Order:** | |
| 1. **Forward-fill** (for continuous series with gradual changes) | |
| 2. **Zero-fill** (for event-based fields where absence means "not active") | |
| 3. **Interpolation** (for weather-related smooth series) | |
| 4. **Drop** (only if >30% missing in critical fields) | |
| **Field-Specific Strategies:** | |
| | Field | Strategy | Justification | | |
| |-------|----------|---------------| | |
| | `ram_after` | Forward-fill (max 2 hours), then linear interpolation | Capacity changes gradually | | |
| | `presolved` | Zero-fill → False | Missing = not binding | | |
| | `shadow_price` | Zero-fill | Missing = no congestion cost | | |
| | `ptdf_value` | Zero-fill | Missing = no sensitivity | | |
| | `fmax` | **NO FILL - MUST HAVE** | Critical for normalization | | |
| | `flow_fb` | Zero-fill | Missing = no flow | | |
| | `cnec_name` | Forward-fill | Descriptive only | | |
| **Implementation:** | |
| ```python | |
| def clean_jao_data(df: pl.DataFrame) -> pl.DataFrame: | |
| """Clean JAO data with field-specific strategies""" | |
| # 1. Handle critical fields (fail if missing) | |
| critical_fields = ['timestamp', 'cnec_id', 'fmax'] | |
| for field in critical_fields: | |
| if df[field].null_count() > 0: | |
| raise ValueError(f"Critical field {field} has missing values") | |
| # 2. Forward-fill continuous series | |
| df = df.with_columns([ | |
| pl.col('ram_after').fill_null(strategy='forward').fill_null(strategy='backward'), | |
| pl.col('ram_before').fill_null(strategy='forward').fill_null(strategy='backward'), | |
| pl.col('cnec_name').fill_null(strategy='forward') | |
| ]) | |
| # 3. Zero-fill event indicators | |
| df = df.with_columns([ | |
| pl.col('presolved').fill_null(False), | |
| pl.col('shadow_price').fill_null(0.0), | |
| pl.col('flow_fb').fill_null(0.0), | |
| pl.col('remedial_actions_applied').fill_null(False) | |
| ]) | |
| # 4. Linear interpolation for remaining RAM gaps (max 2 hours) | |
| df = df.with_columns([ | |
| pl.col('ram_after').interpolate(method='linear') | |
| ]) | |
| # 5. Fill contingency with "basecase" if missing | |
| df = df.with_columns([ | |
| pl.col('contingency').fill_null('basecase') | |
| ]) | |
| return df | |
| ``` | |
| ### 3.2 Outlier Detection & Treatment | |
| **RAM Outliers:** | |
| ```python | |
| def detect_ram_outliers(df: pl.DataFrame) -> pl.DataFrame: | |
| """Flag and clip RAM outliers""" | |
| # 1. Physical constraints | |
| df = df.with_columns([ | |
| # RAM cannot exceed Fmax | |
| pl.when(pl.col('ram_after') > pl.col('fmax')) | |
| .then(pl.col('fmax')) | |
| .otherwise(pl.col('ram_after')) | |
| .alias('ram_after'), | |
| # RAM cannot be negative (clip to 0) | |
| pl.when(pl.col('ram_after') < 0) | |
| .then(0.0) | |
| .otherwise(pl.col('ram_after')) | |
| .alias('ram_after') | |
| ]) | |
| # 2. Statistical outliers (sudden unrealistic jumps) | |
| # Flag if RAM drops >80% in 1 hour (likely data error) | |
| df = df.with_columns([ | |
| (pl.col('ram_after').diff() / pl.col('ram_after').shift(1) < -0.8) | |
| .fill_null(False) | |
| .alias('outlier_sudden_drop') | |
| ]) | |
| # 3. Replace flagged outliers with interpolation | |
| df = df.with_columns([ | |
| pl.when(pl.col('outlier_sudden_drop')) | |
| .then(None) | |
| .otherwise(pl.col('ram_after')) | |
| .alias('ram_after') | |
| ]).with_columns([ | |
| pl.col('ram_after').interpolate() | |
| ]) | |
| return df.drop('outlier_sudden_drop') | |
| ``` | |
| **PTDF Outliers:** | |
| ```python | |
| def clean_ptdf_outliers(ptdf_matrix: np.ndarray) -> np.ndarray: | |
| """Clean PTDF values to physical bounds""" | |
| # PTDFs theoretically in [-1, 1] but can be slightly outside due to rounding | |
| # Clip to [-1.5, 1.5] to catch errors while allowing small overruns | |
| ptdf_matrix = np.clip(ptdf_matrix, -1.5, 1.5) | |
| # Replace any NaN/Inf with 0 (no sensitivity) | |
| ptdf_matrix = np.nan_to_num(ptdf_matrix, nan=0.0, posinf=0.0, neginf=0.0) | |
| return ptdf_matrix | |
| ``` | |
| **Shadow Price Outliers:** | |
| ```python | |
| def clean_shadow_prices(df: pl.DataFrame) -> pl.DataFrame: | |
| """Handle extreme shadow prices""" | |
| # Shadow prices above €500/MW are rare but possible during extreme scarcity | |
| # Cap at €1000/MW to prevent contamination (99.9th percentile historically) | |
| df = df.with_columns([ | |
| pl.when(pl.col('shadow_price') > 1000) | |
| .then(1000) | |
| .otherwise(pl.col('shadow_price')) | |
| .alias('shadow_price') | |
| ]) | |
| return df | |
| ``` | |
| ### 3.3 Timestamp Alignment | |
| **Challenge:** JAO publishes data for "delivery day D+1" on "business day D" | |
| **Solution:** | |
| ```python | |
| def align_jao_timestamps(df: pl.DataFrame) -> pl.DataFrame: | |
| """Convert business day to delivery hour timestamps""" | |
| # JAO format: business_day='2025-10-28', delivery_hour=1-24 | |
| # Convert to: timestamp='2025-10-29 00:00' to '2025-10-29 23:00' (UTC) | |
| df = df.with_columns([ | |
| # Delivery timestamp = business_day + 1 day + (delivery_hour - 1) | |
| (pl.col('business_day') + pl.duration(days=1) + | |
| pl.duration(hours=pl.col('delivery_hour') - 1)) | |
| .alias('timestamp') | |
| ]) | |
| # Convert from CET/CEST to UTC for consistency | |
| df = df.with_columns([ | |
| pl.col('timestamp').dt.convert_time_zone('UTC') | |
| ]) | |
| return df | |
| ``` | |
| ### 3.4 Duplicate Handling | |
| **Sources of Duplicates:** | |
| 1. D-2 and D-1 PTDF updates (keep D-1 only) | |
| 2. Corrected publications (keep latest by publication time) | |
| 3. Intraday updates (out of scope for day-ahead forecasting) | |
| **Deduplication Strategy:** | |
| ```python | |
| def deduplicate_jao_data(df: pl.DataFrame) -> pl.DataFrame: | |
| """Remove duplicate records, keeping most recent version""" | |
| # For PTDF data: keep D-1 version | |
| if 'version' in df.columns: | |
| df = df.filter(pl.col('version') == 'D-1') | |
| # For all data: keep latest publication time per (timestamp, cnec_id) | |
| df = (df | |
| .sort('publication_time', descending=True) | |
| .unique(subset=['timestamp', 'cnec_id'], keep='first')) | |
| return df | |
| ``` | |
| --- | |
| ## 4. CNEC Treatment | |
| ### 4.1 CNEC Selection Strategy | |
| **Goal:** Reduce from **~2,000 CNECs published daily by TSOs** to **Top 200 most impactful CNECs** | |
| **Two-Tier Approach:** | |
| - **Tier 1 (Top 50 CNECs):** Full feature detail (5 features each = 250 features) | |
| - **Tier 2 (Next 150 CNECs):** Selective features (2 features each = 300 features) | |
| - **Total:** 200 CNECs tracked | |
| **Selection Criteria:** | |
| 1. **Binding Frequency** (40% weight): How often was CNEC active? | |
| 2. **Economic Impact** (30% weight): Average shadow price when binding | |
| 3. **RAM Utilization** (20% weight): How close to limit (low RAM)? | |
| 4. **Geographic Coverage** (10% weight): Ensure all borders represented | |
| **Implementation:** | |
| ```python | |
| def select_top_cnecs(df: pl.DataFrame, n_cnecs: int = 200) -> dict[str, list[str]]: | |
| """ | |
| Select top 200 most impactful CNECs from ~2,000 published daily | |
| Returns: {'tier1': list of 50 CNECs, 'tier2': list of 150 CNECs} | |
| """ | |
| cnec_impact = df.groupby('cnec_id').agg([ | |
| # Binding frequency (times active / total hours) | |
| (pl.col('presolved').sum() / pl.col('presolved').count()) | |
| .alias('binding_freq'), | |
| # Average shadow price when binding | |
| pl.col('shadow_price').filter(pl.col('presolved')).mean() | |
| .alias('avg_shadow_price'), | |
| # RAM utilization (% of hours below 30% of Fmax) | |
| ((pl.col('ram_after') < 0.3 * pl.col('fmax')).sum() / | |
| pl.col('ram_after').count()) | |
| .alias('low_ram_freq'), | |
| # Days appeared (consistency metric) | |
| pl.col('cnec_id').count().alias('days_appeared'), | |
| # Geographic zone (parse from CNEC ID) | |
| pl.col('tso').first().alias('tso') | |
| ]) | |
| # Calculate composite impact score | |
| cnec_impact = cnec_impact.with_columns([ | |
| (0.40 * pl.col('binding_freq') + | |
| 0.30 * (pl.col('avg_shadow_price') / 100) + # Normalize to [0,1] | |
| 0.20 * pl.col('low_ram_freq') + | |
| 0.10 * (pl.col('days_appeared') / 365)) # Consistency bonus | |
| .alias('impact_score') | |
| ]) | |
| # Select top N, ensuring geographic diversity | |
| top_cnecs = [] | |
| # First, get top 10 per major bidding zone border (ensures coverage) | |
| for border in ['DE_FR', 'DE_NL', 'DE_CZ', 'FR_BE', 'AT_CZ', 'DE_AT', | |
| 'FR_ES', 'AT_IT', 'DE_PL', 'BE_NL']: | |
| border_cnecs = (cnec_impact | |
| .filter(pl.col('cnec_id').str.contains(border)) | |
| .sort('impact_score', descending=True) | |
| .head(10)) | |
| top_cnecs.extend(border_cnecs['cnec_id'].to_list()) | |
| # Remove duplicates (some CNECs may match multiple borders) | |
| top_cnecs = list(set(top_cnecs)) | |
| # Then add remaining highest-impact CNECs to reach n_cnecs | |
| remaining = (cnec_impact | |
| .filter(~pl.col('cnec_id').is_in(top_cnecs)) | |
| .sort('impact_score', descending=True) | |
| .head(n_cnecs - len(top_cnecs))) | |
| top_cnecs.extend(remaining['cnec_id'].to_list()) | |
| # Split into two tiers | |
| tier1_cnecs = top_cnecs[:50] # Top 50 get full feature detail | |
| tier2_cnecs = top_cnecs[50:200] # Next 150 get selective features | |
| return { | |
| 'tier1': tier1_cnecs, | |
| 'tier2': tier2_cnecs, | |
| 'all': top_cnecs[:200] | |
| } | |
| ``` | |
| **Why 200 CNECs?** | |
| - From ~2,000 published daily, most are rarely binding or have minimal economic impact | |
| - Top 200 captures >95% of all binding events and congestion costs | |
| - Manageable feature space while preserving critical network information | |
| - Two-tier approach balances detail vs. dimensionality | |
| ### 4.2 CNEC Masking Strategy | |
| **Problem:** Not all CNECs are published every day (only those considered relevant by TSOs) | |
| **Solution:** Create a "Master CNEC Set" with masking | |
| ```python | |
| def create_cnec_master_set(df: pl.DataFrame, top_cnecs: list[str]) -> pl.DataFrame: | |
| """Create complete CNEC Ã timestamp matrix with masking""" | |
| # 1. Create complete timestamp Ã CNEC grid | |
| all_timestamps = df.select('timestamp').unique().sort('timestamp') | |
| cnec_template = pl.DataFrame({'cnec_id': top_cnecs}) | |
| # Cartesian product: all timestamps Ã all CNECs | |
| complete_grid = all_timestamps.join(cnec_template, how='cross') | |
| # 2. Left join with actual data | |
| complete_data = complete_grid.join( | |
| df.select(['timestamp', 'cnec_id', 'ram_after', 'presolved', | |
| 'shadow_price', 'fmax']), | |
| on=['timestamp', 'cnec_id'], | |
| how='left' | |
| ) | |
| # 3. Add masking indicator | |
| complete_data = complete_data.with_columns([ | |
| # Mask = 1 if CNEC was published (data available) | |
| # Mask = 0 if CNEC not published (implicitly unconstrained) | |
| (~pl.col('ram_after').is_null()).cast(pl.Int8).alias('cnec_mask') | |
| ]) | |
| # 4. Fill missing values according to strategy | |
| complete_data = complete_data.with_columns([ | |
| # If not published, assume unconstrained: | |
| # - RAM = Fmax (maximum margin) | |
| # - presolved = False (not binding) | |
| # - shadow_price = 0 (no congestion) | |
| pl.col('ram_after').fill_null(pl.col('fmax')), | |
| pl.col('presolved').fill_null(False), | |
| pl.col('shadow_price').fill_null(0.0) | |
| ]) | |
| return complete_data | |
| ``` | |
| **Resulting Schema:** | |
| ``` | |
| timestamp cnec_id ram_after presolved shadow_price cnec_mask fmax | |
| 2024-10-01 00:00 DE_CZ_TIE_001 450.2 True 12.5 1 800 | |
| 2024-10-01 00:00 DE_FR_LINE_005 800.0 False 0.0 0 800 # Not published | |
| 2024-10-01 01:00 DE_CZ_TIE_001 432.1 True 15.2 1 800 | |
| ... | |
| ``` | |
| ### 4.3 CNEC Feature Derivation | |
| **From master CNEC set, derive:** | |
| ```python | |
| def engineer_cnec_features(df: pl.DataFrame) -> pl.DataFrame: | |
| """Create CNEC-based features for forecasting""" | |
| df = df.with_columns([ | |
| # 1. Margin ratio (normalized RAM) | |
| (pl.col('ram_after') / pl.col('fmax')).alias('margin_ratio'), | |
| # 2. Binding frequency (7-day rolling) | |
| pl.col('presolved').cast(pl.Int8).rolling_mean(window_size=168) | |
| .over('cnec_id').alias('binding_freq_7d'), | |
| # 3. Binding frequency (30-day rolling) | |
| pl.col('presolved').cast(pl.Int8).rolling_mean(window_size=720) | |
| .over('cnec_id').alias('binding_freq_30d'), | |
| # 4. MinRAM compliance | |
| (pl.col('ram_after') < 0.7 * pl.col('fmax')) | |
| .cast(pl.Int8).alias('below_minram'), | |
| # 5. RAM volatility (7-day rolling std) | |
| pl.col('ram_after').rolling_std(window_size=168) | |
| .over('cnec_id').alias('ram_volatility_7d'), | |
| # 6. Shadow price volatility | |
| pl.col('shadow_price').rolling_std(window_size=168) | |
| .over('cnec_id').alias('shadow_price_volatility_7d'), | |
| # 7. CNEC criticality score | |
| (1 - pl.col('margin_ratio')).alias('criticality'), | |
| # 8. Sudden RAM drops (binary flag) | |
| ((pl.col('ram_after') - pl.col('ram_after').shift(1)) < -0.2 * pl.col('fmax')) | |
| .cast(pl.Int8).alias('sudden_ram_drop') | |
| ]) | |
| return df | |
| ``` | |
| --- | |
| ## 5. PTDF Treatment - Hybrid Approach | |
| ### 5.1 Strategy: Preserve Causality for Critical CNECs | |
| **Challenge:** PTDF matrix contains critical network physics but is high-dimensional | |
| - 200 CNECs × 12 zones × 17,520 hours = ~42 million values | |
| - PTDFs encode how zone injections affect each CNEC (network sensitivity) | |
| - **Critical requirement:** Preserve CNEC-specific causality chain: | |
| - `outage_active_cnec_X` → network sensitivity (PTDF) → `presolved_cnec_X` | |
| **Solution: Two-Tier Hybrid Approach** | |
| #### **Tier 1 (Top 50 CNECs): Individual PTDF Features** | |
| **Preserve all 12 zone sensitivities for each of the 50 most critical CNECs** | |
| - Result: 50 CNECs × 12 zones = **600 individual PTDF features** | |
| - Rationale: These are most impactful CNECs - model needs full network physics | |
| - Enables learning: "When `outage_active_cnec_42` = 1 AND high DE_LU injection (via `ptdf_cnec_42_DE_LU` sensitivity), THEN `presolved_cnec_42` likely = 1" | |
| #### **Tier 2 (Next 150 CNECs): Border-Level PTDF Aggregates** | |
| **Aggregate PTDF sensitivities by border-zone pairs** | |
| - Result: 10 borders × 12 zones = **120 aggregate PTDF features** | |
| - Rationale: Captures regional network patterns without full individual resolution | |
| - Preserves interpretability: "How sensitive are DE-CZ tier-2 CNECs to German generation?" | |
| - Still allows causal learning: `outage_active_cnec_tier2` → `avg_ptdf_border` → `presolved_cnec_tier2` | |
| **Total PTDF Features: 720 (600 individual + 120 aggregate)** | |
| **Why This Approach?** | |
| 1. ✅ **Preserves CNEC-specific causality for top 50** - most critical constraints | |
| 2. ✅ **Avoids PCA mixing** - no loss of CNEC identification | |
| 3. ✅ **Interpretable** - can explain which zones affect which borders | |
| 4. ✅ **Dimensionality reduction** - 1,800 tier-2 values → 120 features (15:1) | |
| 5. ✅ **Maintains geographic structure** - DE-CZ separate from FR-BE | |
| ### 5.2 Implementation: Top 50 Individual PTDFs | |
| ```python | |
| def extract_top50_ptdfs(ptdf_matrix: np.ndarray, | |
| top_50_cnec_ids: list[str], | |
| all_cnec_ids: list[str], | |
| zones: list[str]) -> pl.DataFrame: | |
| """ | |
| Extract individual PTDF values for top 50 CNECs | |
| Args: | |
| ptdf_matrix: Shape (n_timestamps, n_cnecs, n_zones) | |
| top_50_cnec_ids: List of 50 most critical CNEC identifiers | |
| all_cnec_ids: Full list of CNEC IDs matching matrix dimension | |
| zones: List of 12 Core zone identifiers ['DE_LU', 'FR', 'BE', 'NL', ...] | |
| Returns: | |
| DataFrame with 600 PTDF features (50 CNECs × 12 zones) | |
| """ | |
| # Find indices of top 50 CNECs in the full matrix | |
| top_50_indices = [all_cnec_ids.index(cnec) for cnec in top_50_cnec_ids] | |
| # Extract PTDF values for top 50 CNECs | |
| # Shape: (n_timestamps, 50, 12) | |
| top_50_ptdfs = ptdf_matrix[:, top_50_indices, :] | |
| # Reshape to flat features: (n_timestamps, 600) | |
| n_timestamps = top_50_ptdfs.shape[0] | |
| features_dict = {} | |
| for cnec_idx, cnec_id in enumerate(top_50_cnec_ids): | |
| for zone_idx, zone in enumerate(zones): | |
| feature_name = f'ptdf_cnec_{cnec_id}_{zone}' | |
| features_dict[feature_name] = top_50_ptdfs[:, cnec_idx, zone_idx] | |
| print(f"✓ Extracted {len(features_dict)} individual PTDF features for top 50 CNECs") | |
| return pl.DataFrame(features_dict) | |
| ``` | |
| ### 5.3 Implementation: Tier-2 Border Aggregates | |
| ```python | |
| def aggregate_tier2_ptdfs_by_border(ptdf_matrix: np.ndarray, | |
| tier2_cnec_ids: list[str], | |
| all_cnec_ids: list[str], | |
| zones: list[str]) -> pl.DataFrame: | |
| """ | |
| Aggregate PTDF sensitivities for tier-2 CNECs by border-zone pairs | |
| Args: | |
| ptdf_matrix: Shape (n_timestamps, n_cnecs, n_zones) | |
| tier2_cnec_ids: List of 150 tier-2 CNEC identifiers | |
| all_cnec_ids: Full list of CNEC IDs | |
| zones: List of 12 Core zone identifiers | |
| Returns: | |
| DataFrame with 120 features (10 borders × 12 zones) | |
| """ | |
| # Define major borders | |
| major_borders = ['DE_FR', 'DE_NL', 'DE_CZ', 'FR_BE', 'AT_CZ', | |
| 'DE_AT', 'BE_NL', 'AT_IT', 'DE_PL', 'CZ_PL'] | |
| # Find indices of tier-2 CNECs | |
| tier2_indices = [all_cnec_ids.index(cnec) for cnec in tier2_cnec_ids] | |
| # Extract tier-2 PTDF subset | |
| tier2_ptdfs = ptdf_matrix[:, tier2_indices, :] # (n_timestamps, 150, 12) | |
| features_dict = {} | |
| for border in major_borders: | |
| # Find tier-2 CNECs belonging to this border | |
| border_cnec_indices = [ | |
| i for i, cnec_id in enumerate(tier2_cnec_ids) | |
| if border.replace('_', '').lower() in cnec_id.lower() | |
| ] | |
| if len(border_cnec_indices) == 0: | |
| # No tier-2 CNECs for this border - use zeros | |
| for zone in zones: | |
| features_dict[f'avg_ptdf_{border}_{zone}_tier2'] = np.zeros(tier2_ptdfs.shape[0]) | |
| continue | |
| # Average PTDF across all tier-2 CNECs on this border | |
| for zone_idx, zone in enumerate(zones): | |
| border_zone_ptdfs = tier2_ptdfs[:, border_cnec_indices, zone_idx] | |
| avg_ptdf = np.mean(border_zone_ptdfs, axis=1) | |
| features_dict[f'avg_ptdf_{border}_{zone}_tier2'] = avg_ptdf | |
| print(f"✓ Aggregated {len(features_dict)} border-zone PTDF features for tier-2") | |
| return pl.DataFrame(features_dict) | |
| ``` | |
| ### 5.4 Complete PTDF Pipeline | |
| ```python | |
| def prepare_ptdf_features(ptdf_matrix: np.ndarray, | |
| top_50_cnecs: list[str], | |
| tier2_cnecs: list[str], | |
| all_cnec_ids: list[str], | |
| zones: list[str]) -> pl.DataFrame: | |
| """ | |
| Complete PTDF feature engineering pipeline | |
| Returns: | |
| DataFrame with 720 PTDF features total: | |
| - 600 individual features (top 50 CNECs × 12 zones) | |
| - 120 aggregate features (10 borders × 12 zones for tier-2) | |
| """ | |
| # 1. Extract individual PTDFs for top 50 | |
| top50_features = extract_top50_ptdfs( | |
| ptdf_matrix, top_50_cnecs, all_cnec_ids, zones | |
| ) | |
| # 2. Aggregate tier-2 PTDFs by border | |
| tier2_features = aggregate_tier2_ptdfs_by_border( | |
| ptdf_matrix, tier2_cnecs, all_cnec_ids, zones | |
| ) | |
| # 3. Combine | |
| all_ptdf_features = pl.concat([top50_features, tier2_features], how='horizontal') | |
| print(f"\n✓ PTDF Features Generated:") | |
| print(f" Top 50 individual: {top50_features.shape[1]} features") | |
| print(f" Tier-2 aggregated: {tier2_features.shape[1]} features") | |
| print(f" Total PTDF features: {all_ptdf_features.shape[1]}") | |
| return all_ptdf_features | |
| ``` | |
| ### 5.5 PTDF Feature Validation | |
| ```python | |
| def validate_ptdf_features(ptdf_features: pl.DataFrame) -> None: | |
| """Validate PTDF feature quality""" | |
| # 1. Check for valid range (PTDFs should be in [-1, 1]) | |
| outliers = [] | |
| for col in ptdf_features.columns: | |
| if col.startswith('ptdf_') or col.startswith('avg_ptdf_'): | |
| values = ptdf_features[col] | |
| min_val, max_val = values.min(), values.max() | |
| if min_val < -1.5 or max_val > 1.5: | |
| outliers.append(f"{col}: [{min_val:.3f}, {max_val:.3f}]") | |
| if outliers: | |
| print(f"âš WARNING: {len(outliers)} PTDF features outside expected [-1, 1] range") | |
| for out in outliers[:5]: # Show first 5 | |
| print(f" {out}") | |
| # 2. Check for constant features (no variance) | |
| constant_features = [] | |
| for col in ptdf_features.columns: | |
| if col.startswith('ptdf_') or col.startswith('avg_ptdf_'): | |
| if ptdf_features[col].std() < 1e-6: | |
| constant_features.append(col) | |
| if constant_features: | |
| print(f"âš WARNING: {len(constant_features)} PTDF features have near-zero variance") | |
| # 3. Summary statistics | |
| print(f"\n✓ PTDF Feature Validation Complete:") | |
| print(f" Features in valid range: {len(ptdf_features.columns) - len(outliers)}") | |
| print(f" Features with variance: {len(ptdf_features.columns) - len(constant_features)}") | |
| ``` | |
| --- | |
| ## 6. RAM Treatment | |
| ### 6.1 RAM Normalization | |
| **Challenge:** Absolute RAM values vary widely across CNECs (50 MW to 2000 MW) | |
| **Solution:** Normalize to margin ratio | |
| ```python | |
| def normalize_ram_values(df: pl.DataFrame) -> pl.DataFrame: | |
| """Normalize RAM to comparable scale""" | |
| df = df.with_columns([ | |
| # 1. Margin ratio: RAM / Fmax | |
| (pl.col('ram_after') / pl.col('fmax')).alias('margin_ratio'), | |
| # 2. MinRAM compliance ratio: RAM / (0.7 Ã Fmax) | |
| (pl.col('ram_after') / (0.7 * pl.col('fmax'))).alias('minram_compliance'), | |
| # 3. Percentile within CNEC's historical distribution | |
| pl.col('ram_after').rank(method='average').over('cnec_id') | |
| .alias('ram_percentile') | |
| ]) | |
| # Cap margin_ratio at 1.0 (cannot exceed Fmax) | |
| df = df.with_columns([ | |
| pl.when(pl.col('margin_ratio') > 1.0) | |
| .then(1.0) | |
| .otherwise(pl.col('margin_ratio')) | |
| .alias('margin_ratio') | |
| ]) | |
| return df | |
| ``` | |
| ### 6.2 RAM Time Series Features | |
| ```python | |
| def engineer_ram_features(df: pl.DataFrame) -> pl.DataFrame: | |
| """Create rolling window features from RAM values""" | |
| df = df.with_columns([ | |
| # 1. Moving averages | |
| pl.col('ram_after').rolling_mean(window_size=24).alias('ram_ma_24h'), | |
| pl.col('ram_after').rolling_mean(window_size=168).alias('ram_ma_7d'), | |
| pl.col('ram_after').rolling_mean(window_size=720).alias('ram_ma_30d'), | |
| # 2. Volatility measures | |
| pl.col('ram_after').rolling_std(window_size=168).alias('ram_std_7d'), | |
| pl.col('ram_after').rolling_std(window_size=720).alias('ram_std_30d'), | |
| # 3. Percentiles (vs 90-day history) | |
| pl.col('ram_after') | |
| .rolling_quantile(quantile=0.1, window_size=2160) | |
| .alias('ram_p10_90d'), | |
| pl.col('ram_after') | |
| .rolling_quantile(quantile=0.5, window_size=2160) | |
| .alias('ram_median_90d'), | |
| # 4. Rate of change | |
| (pl.col('ram_after').diff(1) / pl.col('ram_after').shift(1)) | |
| .alias('ram_pct_change_1h'), | |
| (pl.col('ram_after').diff(24) / pl.col('ram_after').shift(24)) | |
| .alias('ram_pct_change_24h'), | |
| # 5. Binary indicators | |
| (pl.col('ram_after') < 0.3 * pl.col('fmax')) | |
| .cast(pl.Int8).alias('low_ram_flag'), | |
| ((pl.col('ram_after') - pl.col('ram_after').shift(1)) < -0.2 * pl.col('fmax')) | |
| .cast(pl.Int8).alias('sudden_drop_flag'), | |
| # 6. MinRAM violation counter (7-day window) | |
| (pl.col('ram_after') < 0.7 * pl.col('fmax')) | |
| .cast(pl.Int8) | |
| .rolling_sum(window_size=168) | |
| .alias('minram_violations_7d') | |
| ]) | |
| return df | |
| ``` | |
| ### 6.3 Cross-Border RAM Aggregation | |
| **Create system-level RAM features:** | |
| ```python | |
| def aggregate_ram_system_wide(df: pl.DataFrame) -> pl.DataFrame: | |
| """Aggregate RAM across all CNECs to system-level features""" | |
| system_features = df.groupby('timestamp').agg([ | |
| # 1. Average margin across all CNECs | |
| pl.col('margin_ratio').mean().alias('system_avg_margin'), | |
| # 2. Minimum margin (most constrained CNEC) | |
| pl.col('margin_ratio').min().alias('system_min_margin'), | |
| # 3. Count of binding CNECs | |
| pl.col('presolved').sum().alias('n_binding_cnecs'), | |
| # 4. Count of CNECs below minRAM | |
| (pl.col('ram_after') < 0.7 * pl.col('fmax')).sum() | |
| .alias('n_cnecs_below_minram'), | |
| # 5. Standard deviation of margins (uniformity) | |
| pl.col('margin_ratio').std().alias('margin_std'), | |
| # 6. Total economic cost (sum of shadow prices) | |
| pl.col('shadow_price').sum().alias('total_congestion_cost'), | |
| # 7. Max shadow price | |
| pl.col('shadow_price').max().alias('max_shadow_price') | |
| ]) | |
| return system_features | |
| ``` | |
| --- | |
| ## 7. Shadow Prices Treatment | |
| ### 7.1 Shadow Price Features | |
| ```python | |
| def engineer_shadow_price_features(df: pl.DataFrame) -> pl.DataFrame: | |
| """Create features from shadow prices (economic signals)""" | |
| df = df.with_columns([ | |
| # 1. Rolling statistics | |
| pl.col('shadow_price').rolling_mean(window_size=24) | |
| .alias('shadow_price_ma_24h'), | |
| pl.col('shadow_price').rolling_max(window_size=24) | |
| .alias('shadow_price_max_24h'), | |
| pl.col('shadow_price').rolling_std(window_size=168) | |
| .alias('shadow_price_volatility_7d'), | |
| # 2. Binary indicators | |
| (pl.col('shadow_price') > 0).cast(pl.Int8) | |
| .alias('has_congestion_cost'), | |
| (pl.col('shadow_price') > 50).cast(pl.Int8) | |
| .alias('high_congestion_flag'), # €50/MW threshold | |
| # 3. Economic stress indicator (count of expensive hours in 7d) | |
| (pl.col('shadow_price') > 50).cast(pl.Int8) | |
| .rolling_sum(window_size=168) | |
| .alias('expensive_hours_7d'), | |
| # 4. Shadow price persistence (how long has it been >0?) | |
| # Create groups of consecutive non-zero shadow prices | |
| ((pl.col('shadow_price') > 0) != (pl.col('shadow_price').shift(1) > 0)) | |
| .cum_sum() | |
| .alias('shadow_price_regime'), | |
| ]) | |
| # 5. Duration in current regime | |
| df = df.with_columns([ | |
| pl.col('shadow_price_regime') | |
| .rank(method='dense').over('shadow_price_regime') | |
| .alias('hours_in_regime') | |
| ]) | |
| return df | |
| ``` | |
| --- | |
| ## 8. Feature Engineering Pipeline | |
| ### 8.1 Complete Feature Matrix Construction | |
| **Final feature set: ~1,060 features (Option A Architecture)** | |
| - **Historical Context:** 730 features (what happened in past 21 days) | |
| - **Future Covariates:** 280 features (what we know will happen in next 14 days) | |
| - **System-Level:** 50 features (real-time aggregates) | |
| **Architecture Rationale:** | |
| - **Top 50 CNECs:** Full detail (5 metrics each = 250 features) | |
| - Most impactful CNECs get complete representation | |
| - **Tier-2 150 CNECs:** Selective detail (300 features) | |
| - Binary indicators: `presolved` (binding) + `outage_active` (150 + 150 = 300 features) | |
| - Aggregated metrics: RAM, shadow prices by border (30 features) | |
| - Preserves high-signal discrete events while reducing continuous redundancy | |
| - **Supporting Features:** 380 features | |
| - PTDF patterns, border capacities, weather, temporal, interactions | |
| ```python | |
| class JAOFeatureEngineer: | |
| """ | |
| Complete JAO feature engineering pipeline - Option A Architecture | |
| Architecture: | |
| - Top 50 CNECs: Full detail (5 metrics Ã 50 = 250 features) | |
| - Tier-2 150 CNECs: Selective (presolved + outage_active = 300 features) | |
| - Tier-2 aggregated: Border-level RAM/shadow prices (30 features) | |
| - Supporting: PTDF, borders, system, temporal, weather (150 features) | |
| Total: ~730 historical features + ~280 future covariates | |
| Inputs: Raw JAO data (CNECs, PTDFs, RAMs, shadow prices) + ENTSO-E outages | |
| Outputs: Engineered feature matrix (512h Ã 730 features) + | |
| Future covariates (336h Ã 280 features) | |
| """ | |
| def __init__(self, top_n_cnecs: int = 50, tier2_n_cnecs: int = 150, | |
| ptdf_components: int = 10): | |
| self.top_n_cnecs = top_n_cnecs | |
| self.tier2_n_cnecs = tier2_n_cnecs | |
| self.total_cnecs = top_n_cnecs + tier2_n_cnecs # 200 total | |
| self.ptdf_components = ptdf_components | |
| self.pca_model = None | |
| self.scaler = None | |
| self.top_cnecs = None | |
| self.tier2_cnecs = None | |
| def fit(self, historical_data: dict): | |
| """ | |
| Fit on 12-month historical data to establish baselines | |
| Args: | |
| historical_data: { | |
| 'cnecs': pl.DataFrame, | |
| 'ptdfs': np.ndarray, | |
| 'rams': pl.DataFrame, | |
| 'shadow_prices': pl.DataFrame, | |
| 'outages': pl.DataFrame (from ENTSO-E via EIC matching) | |
| } | |
| """ | |
| print("Fitting JAO feature engineer on historical data...") | |
| # 1. Select top 200 CNECs (50 + 150) | |
| print(f" - Selecting top {self.total_cnecs} CNECs...") | |
| all_selected = select_top_cnecs( | |
| historical_data['cnecs'], | |
| n_cnecs=self.total_cnecs | |
| ) | |
| self.top_cnecs = all_selected[:self.top_n_cnecs] | |
| self.tier2_cnecs = all_selected[self.top_n_cnecs:] | |
| print(f" Top 50: {self.top_cnecs[:3]}... (full detail)") | |
| print(f" Tier-2 150: {self.tier2_cnecs[:3]}... (selective detail)") | |
| # 2. Engineer PTDF features (hybrid approach) | |
| print(" - Engineering PTDF features...") | |
| ptdf_compressed, self.pca_model, self.scaler = reduce_ptdf_dimensions( | |
| historical_data['ptdfs'], | |
| n_components=self.ptdf_components | |
| ) | |
| # 3. Calculate historical baselines (for percentiles, etc.) | |
| print(" - Computing historical baselines...") | |
| self.ram_baseline = historical_data['rams'].groupby('cnec_id').agg([ | |
| pl.col('ram_after').mean().alias('ram_mean'), | |
| pl.col('ram_after').std().alias('ram_std'), | |
| pl.col('ram_after').quantile(0.1).alias('ram_p10'), | |
| pl.col('ram_after').quantile(0.9).alias('ram_p90') | |
| ]) | |
| print("✓ Feature engineer fitted") | |
| def transform(self, | |
| data: dict, | |
| start_time: str, | |
| end_time: str) -> tuple[np.ndarray, np.ndarray]: | |
| """ | |
| Transform JAO data to feature matrices | |
| Args: | |
| data: Same structure as fit() | |
| start_time: Start of context window (e.g., 21 days before prediction) | |
| end_time: End of context window (prediction time) | |
| Returns: | |
| historical_features: (n_hours, 730) array of historical context | |
| future_features: (n_hours_future, 280) array of future covariates | |
| """ | |
| n_hours = len(pd.date_range(start_time, end_time, freq='H')) | |
| historical_features = np.zeros((n_hours, 730)) | |
| print("Engineering historical features...") | |
| # Filter to time window | |
| cnec_data = (data['cnecs'] | |
| .filter(pl.col('timestamp').is_between(start_time, end_time))) | |
| outage_data = (data['outages'] | |
| .filter(pl.col('timestamp').is_between(start_time, end_time))) | |
| # === TOP 50 CNECs - FULL DETAIL (250 features) === | |
| print(" - Top 50 CNECs (full detail)...") | |
| top50_data = cnec_data.filter(pl.col('cnec_id').is_in(self.top_cnecs)) | |
| top50_outages = outage_data.filter(pl.col('cnec_id').is_in(self.top_cnecs)) | |
| col_idx = 0 | |
| for cnec_id in self.top_cnecs: | |
| cnec_series = top50_data.filter(pl.col('cnec_id') == cnec_id).sort('timestamp') | |
| cnec_outage = top50_outages.filter(pl.col('cnec_id') == cnec_id).sort('timestamp') | |
| # 5 metrics per CNEC | |
| historical_features[:, col_idx] = cnec_series['ram_after'].to_numpy() | |
| historical_features[:, col_idx + 1] = (cnec_series['ram_after'] / cnec_series['fmax']).to_numpy() | |
| historical_features[:, col_idx + 2] = cnec_series['presolved'].cast(pl.Int8).to_numpy() | |
| historical_features[:, col_idx + 3] = cnec_series['shadow_price'].to_numpy() | |
| historical_features[:, col_idx + 4] = cnec_outage['outage_active'].to_numpy() | |
| col_idx += 5 | |
| # === TIER-2 150 CNECs - SELECTIVE DETAIL (300 features) === | |
| print(" - Tier-2 150 CNECs (presolved + outage)...") | |
| tier2_data = cnec_data.filter(pl.col('cnec_id').is_in(self.tier2_cnecs)) | |
| tier2_outages = outage_data.filter(pl.col('cnec_id').is_in(self.tier2_cnecs)) | |
| # Presolved flags (150 features) | |
| for cnec_id in self.tier2_cnecs: | |
| cnec_series = tier2_data.filter(pl.col('cnec_id') == cnec_id).sort('timestamp') | |
| historical_features[:, col_idx] = cnec_series['presolved'].cast(pl.Int8).to_numpy() | |
| col_idx += 1 | |
| # Outage flags (150 features) | |
| for cnec_id in self.tier2_cnecs: | |
| cnec_outage = tier2_outages.filter(pl.col('cnec_id') == cnec_id).sort('timestamp') | |
| historical_features[:, col_idx] = cnec_outage['outage_active'].to_numpy() | |
| col_idx += 1 | |
| # === TIER-2 AGGREGATED METRICS (30 features) === | |
| print(" - Tier-2 aggregated metrics by border...") | |
| borders = ['DE_FR', 'DE_NL', 'DE_CZ', 'DE_BE', 'FR_BE', 'AT_CZ', | |
| 'DE_AT', 'CZ_SK', 'AT_HU', 'PL_CZ'] | |
| for border in borders: | |
| # Get tier-2 CNECs for this border | |
| border_cnecs = [c for c in self.tier2_cnecs if border.replace('_', '') in c] | |
| if not border_cnecs: | |
| historical_features[:, col_idx:col_idx + 3] = 0 | |
| col_idx += 3 | |
| continue | |
| border_data = tier2_data.filter(pl.col('cnec_id').is_in(border_cnecs)) | |
| # Aggregate per timestamp | |
| border_agg = border_data.groupby('timestamp').agg([ | |
| pl.col('ram_after').mean().alias('avg_ram'), | |
| (pl.col('ram_after') / pl.col('fmax')).mean().alias('avg_margin'), | |
| pl.col('shadow_price').sum().alias('total_shadow_price') | |
| ]).sort('timestamp') | |
| historical_features[:, col_idx] = border_agg['avg_ram'].to_numpy() | |
| historical_features[:, col_idx + 1] = border_agg['avg_margin'].to_numpy() | |
| historical_features[:, col_idx + 2] = border_agg['total_shadow_price'].to_numpy() | |
| col_idx += 3 | |
| # === PTDF PATTERNS (10 features) === | |
| print(" - PTDF compression...") | |
| ptdf_subset = data['ptdfs'][start_time:end_time, :, :] | |
| ptdf_2d = ptdf_subset.reshape(len(ptdf_subset), -1) | |
| ptdf_scaled = self.scaler.transform(ptdf_2d) | |
| ptdf_features = self.pca_model.transform(ptdf_scaled) # (n_hours, 10) | |
| historical_features[:, col_idx:col_idx + 10] = ptdf_features | |
| col_idx += 10 | |
| # === BORDER CAPACITY HISTORICAL (20 features) === | |
| print(" - Border capacities...") | |
| capacity_historical = data['entsoe']['crossborder_flows'][start_time:end_time] | |
| historical_features[:, col_idx:col_idx + 20] = capacity_historical | |
| col_idx += 20 | |
| # === SYSTEM-LEVEL AGGREGATES (20 features) === | |
| print(" - System-level aggregates...") | |
| system_features = aggregate_ram_system_wide(cnec_data).to_numpy() | |
| historical_features[:, col_idx:col_idx + 20] = system_features | |
| col_idx += 20 | |
| # === TEMPORAL FEATURES (10 features) === | |
| print(" - Temporal features...") | |
| timestamps = pd.date_range(start_time, end_time, freq='H') | |
| temporal = create_temporal_features(timestamps) | |
| historical_features[:, col_idx:col_idx + 10] = temporal | |
| col_idx += 10 | |
| # === WEATHER FEATURES (50 features) === | |
| print(" - Weather features...") | |
| weather = data['weather'][start_time:end_time] | |
| historical_features[:, col_idx:col_idx + 50] = weather | |
| col_idx += 50 | |
| # === INTERACTION FEATURES (40 features) === | |
| print(" - Interaction features...") | |
| interactions = create_interaction_features( | |
| cnec_data, weather, timestamps | |
| ) | |
| historical_features[:, col_idx:col_idx + 40] = interactions | |
| col_idx += 40 | |
| print(f"✓ Historical features: {historical_features.shape}") | |
| assert col_idx == 730, f"Expected 730 features, got {col_idx}" | |
| # === FUTURE COVARIATES === | |
| future_features = self._create_future_covariates(data, end_time) | |
| return historical_features, future_features | |
| def _create_future_covariates(self, data: dict, prediction_start: str) -> np.ndarray: | |
| """Create future covariates for 14-day horizon""" | |
| prediction_end = pd.Timestamp(prediction_start) + pd.Timedelta(days=14) | |
| n_hours_future = 336 # 14 days Ã 24 hours | |
| future_features = np.zeros((n_hours_future, 280)) | |
| print("Engineering future covariates...") | |
| col_idx = 0 | |
| # Top 50 CNEC planned outages (50 features) | |
| for cnec_id in self.top_cnecs: | |
| planned_outages = data['outages'].filter( | |
| (pl.col('cnec_id') == cnec_id) & | |
| (pl.col('outage_type') == 'planned') & | |
| (pl.col('timestamp') >= prediction_start) & | |
| (pl.col('timestamp') < prediction_end) | |
| ).sort('timestamp') | |
| future_features[:, col_idx] = planned_outages['outage_active'].to_numpy() | |
| col_idx += 1 | |
| # Tier-2 150 CNEC planned outages (150 features) | |
| for cnec_id in self.tier2_cnecs: | |
| planned_outages = data['outages'].filter( | |
| (pl.col('cnec_id') == cnec_id) & | |
| (pl.col('outage_type') == 'planned') & | |
| (pl.col('timestamp') >= prediction_start) & | |
| (pl.col('timestamp') < prediction_end) | |
| ).sort('timestamp') | |
| future_features[:, col_idx] = planned_outages['outage_active'].to_numpy() | |
| col_idx += 1 | |
| # Weather forecasts (50 features) | |
| weather_forecast = data['weather_forecast'][prediction_start:prediction_end] | |
| future_features[:, col_idx:col_idx + 50] = weather_forecast | |
| col_idx += 50 | |
| # Temporal (10 features) | |
| future_timestamps = pd.date_range(prediction_start, prediction_end, freq='H', inclusive='left') | |
| temporal = create_temporal_features(future_timestamps) | |
| future_features[:, col_idx:col_idx + 10] = temporal | |
| col_idx += 10 | |
| # Border capacity adjustments (20 features) | |
| planned_ntc = data['entsoe']['planned_ntc'][prediction_start:prediction_end] | |
| future_features[:, col_idx:col_idx + 20] = planned_ntc | |
| col_idx += 20 | |
| print(f"✓ Future covariates: {future_features.shape}") | |
| assert col_idx == 280, f"Expected 280 future features, got {col_idx}" | |
| return future_features | |
| ``` | |
| ### 8.2 Master Feature List (Option A Architecture) | |
| **Total: ~1,060 features (730 historical + 280 future + 50 real-time aggregates)** | |
| #### **Historical Context Features (730 total)** | |
| **Category 1: Top 50 CNECs - Full Detail (250 features)** | |
| For each of the 50 most impactful CNECs: | |
| - `ram_after_cnec_[ID]`: RAM value (MW) | |
| - `margin_ratio_cnec_[ID]`: RAM / Fmax (0-1 normalized) | |
| - `presolved_cnec_[ID]`: Binding status (1=binding, 0=not binding) | |
| - `shadow_price_cnec_[ID]`: Congestion cost (€/MW) | |
| - `outage_active_cnec_[ID]`: Outage status (1=under outage, 0=operational) | |
| **Selection Criteria:** | |
| ``` | |
| Impact Score = 0.25Ãâ€â€(appearance rate) + 0.30Ãâ€â€(binding frequency) + | |
| 0.20Ãâ€â€(economic impact) + 0.15Ãâ€â€(RAM tightness) + | |
| 0.10Ãâ€â€(geographic importance) | |
| ``` | |
| **Category 2: Tier-2 150 CNECs - Selective Detail (300 features)** | |
| *Individual Binary Indicators (300 features):* | |
| - `presolved_cnec_[ID]`: Binding status for each tier-2 CNEC (150 features) | |
| - Preserves constraint activation patterns | |
| - Allows model to learn which CNECs bind under which conditions | |
| - `outage_active_cnec_[ID]`: Outage status for each tier-2 CNEC (150 features) | |
| - Preserves EIC matching benefit | |
| - Enables learning of outage → binding causality | |
| - Available as future covariate (planned outages known ahead) | |
| *Aggregated Continuous Metrics (30 features):* | |
| Grouped by border (10 borders Ã 3 metrics): | |
| - `avg_ram_[BORDER]_tier2`: Average RAM for tier-2 CNECs on this border | |
| - `avg_margin_ratio_[BORDER]_tier2`: Average margin ratio | |
| - `total_shadow_price_[BORDER]_tier2`: Sum of shadow prices | |
| Borders: DE_FR, DE_NL, DE_CZ, DE_BE, FR_BE, AT_CZ, DE_AT, CZ_SK, AT_HU, PL_CZ | |
| **Category 3: PTDF Patterns (10 features)** | |
| - `ptdf_pc1` through `ptdf_pc10`: Principal components | |
| - Compressed from (200 CNECs Ã 12 zones = 2,400 raw values) | |
| - Captures ~92% of variance in network sensitivities | |
| **Category 4: Border Capacity Historical (20 features)** | |
| - `capacity_hist_de_fr`, `capacity_hist_de_nl`, etc. | |
| - One feature per FBMC border (~20 borders) | |
| - Actual historical cross-border flow capacity from ENTSO-E | |
| **Category 5: System-Level Aggregates (20 features)** | |
| - `system_min_margin`: Tightest CNEC margin across all 200 | |
| - `n_binding_cnecs_total`: Count of all binding CNECs | |
| - `n_binding_cnecs_top50`: Count of top-50 CNECs binding | |
| - `n_binding_cnecs_tier2`: Count of tier-2 CNECs binding | |
| - `margin_std`: Standard deviation of margins (uniformity indicator) | |
| - `total_congestion_cost`: Sum of all shadow prices (€) | |
| - `max_shadow_price`: Highest shadow price (€/MW) | |
| - `avg_shadow_price_binding`: Average shadow price when CNECs bind | |
| - `total_outage_mw_fbmc`: Total transmission capacity under outage | |
| - `n_cnecs_with_outage`: Count of CNECs affected by outages | |
| - `forced_outage_count`: Count of unplanned outages | |
| - `outage_stress_index`: Composite metric (count Ã duration Ã forced_ratio) | |
| - `critical_line_outage_count`: High-voltage (380kV) outages | |
| - `outage_geographic_spread`: Number of unique borders affected | |
| - Additional 6 features: criticality scores, violation counts, etc. | |
| **Category 6: Temporal Features (10 features)** | |
| - `hour_of_day`: 0-23 | |
| - `day_of_week`: 0-6 (Monday=0) | |
| - `month`: 1-12 | |
| - `day_of_year`: 1-365 | |
| - `is_weekend`: Binary (1=Sat/Sun, 0=weekday) | |
| - `is_peak_hour`: Binary (1=8am-8pm, 0=off-peak) | |
| - `is_holiday_de`, `is_holiday_fr`, `is_holiday_be`, `is_holiday_nl`: Country-specific holidays | |
| **Category 7: Weather Features (50 features)** | |
| Key grid points (10-12 strategic locations) Ã 5 metrics: | |
| - Temperature (2m above ground, °C) | |
| - Wind speed at 10m and 100m (m/s) | |
| - Wind direction at 100m (degrees) | |
| - Solar radiation / GHI (W/m²) | |
| - Cloud cover (%) | |
| Grid points cover: German hubs, French nuclear regions, North Sea wind, Alpine corridor, Baltic connections, etc. | |
| **Category 8: Interaction Features (40 features)** | |
| Cross-feature combinations capturing domain knowledge: | |
| - `high_wind_low_margin`: (wind > 20 m/s) Ã (margin < 0.3) | |
| - `weekend_low_demand_pattern`: is_weekend Ã (demand < p30) | |
| - `outage_binding_correlation`: Rolling correlation of outage presence with binding events | |
| - `nuclear_low_wind_pattern`: (FR nuclear < 40 GW) Ã (wind < 10 m/s) | |
| - `solar_peak_congestion`: (solar > 40 GW) Ã (hour = 12-14) | |
| - Various other combinations (weather Ã capacity, temporal Ã binding, etc.) | |
| #### **Future Covariates (280 features)** | |
| **Category 9: Top 50 CNEC Planned Outages (50 features)** | |
| - `planned_outage_cnec_[ID]_d[horizon]`: Binary indicator for D+1 to D+14 | |
| - Known with certainty (scheduled maintenance published by TSOs) | |
| - From ENTSO-E A78 document type, filtered to status='scheduled' | |
| **Category 10: Tier-2 150 CNEC Planned Outages (150 features)** | |
| - `planned_outage_cnec_[ID]_d[horizon]`: Binary for each tier-2 CNEC | |
| - Preserves full EIC matching benefit for future horizon | |
| - High-value features (outages directly constrain capacity) | |
| **Category 11: Weather Forecasts (50 features)** | |
| - Same structure as historical weather (10-12 grid points Ã 5 metrics) | |
| - OpenMeteo provides 14-day forecasts (updated daily) | |
| - Includes: temperature, wind, solar radiation, cloud cover | |
| **Category 12: Future Temporal (10 features)** | |
| - Same temporal features projected for D+1 to D+14 | |
| - Known with certainty (calendar, holidays, hour/day patterns) | |
| **Category 13: Border Capacity Adjustments (20 features)** | |
| - `planned_ntc_[BORDER]_d1`: Day-ahead NTC publication per border | |
| - TSOs publish Net Transfer Capacity (NTC) values D-1 for D-day | |
| - Available via ENTSO-E Transparency Platform | |
| - Provides official capacity forecasts (complementary to our model) | |
| --- | |
| ### Feature Engineering Sequence | |
| **Step 1: Load Raw Data (Day 1)** | |
| - JAO CNECs, PTDFs, RAMs, shadow prices (24 months) | |
| - ENTSO-E outages via EIC matching | |
| - ENTSO-E actual generation, load, cross-border flows | |
| - OpenMeteo weather (historical + forecasts) | |
| **Step 2: Preprocess (Day 1-2)** | |
| - Clean missing values (field-specific strategies) | |
| - Detect and clip outliers | |
| - Align timestamps (CET/CEST → UTC) | |
| - Deduplicate records | |
| **Step 3: CNEC Selection (Day 2)** | |
| - Analyze all ~2,000 CNECs over 24 months | |
| - Rank by impact score | |
| - Select top 200 (50 + 150) | |
| **Step 4: Feature Engineering (Day 2)** | |
| - Top 50: Extract 5 metrics each | |
| - Tier-2: Extract presolved + outage_active individually | |
| - Tier-2: Aggregate RAM/shadow prices by border | |
| - PTDF: Extract individual features for top 50, aggregate for tier-2 | |
| - Create temporal, weather, interaction features | |
| **Step 5: Quality Checks (Day 2)** | |
| - Validate feature distributions | |
| - Check for NaN/Inf values | |
| - Verify feature variance (no zero-variance) | |
| - Confirm temporal alignment | |
| **Step 6: Save Feature Matrix (Day 2)** | |
| ``` | |
| data/processed/ | |
| ââ€Âωâ€Âۉâ€Â€ features_historical_730.parquet # (17520 hours Ã 730 features) | |
| ââ€Âωâ€Âۉâ€Â€ features_future_280.parquet # (17520 hours Ã 280 future covariates) | |
| ââ€Âωâ€Âۉâ€Â€ feature_names.json # Column name mapping | |
| ââ€Ââ€Âââ€Âۉâ€Â€ feature_metadata.json # Descriptions, ranges, types | |
| ``` | |
| --- | |
| ## 9. Quality Assurance | |
| ### 9.1 Data Validation Checks | |
| ```python | |
| def validate_jao_data(df: pl.DataFrame) -> dict: | |
| """ | |
| Run comprehensive validation checks on JAO data | |
| Returns: Dict of validation results | |
| """ | |
| results = { | |
| 'passed': True, | |
| 'warnings': [], | |
| 'errors': [] | |
| } | |
| # 1. Check for critical missing values | |
| critical_fields = ['timestamp', 'cnec_id', 'fmax', 'ram_after'] | |
| for field in critical_fields: | |
| missing_pct = df[field].null_count() / len(df) * 100 | |
| if missing_pct > 0: | |
| results['errors'].append( | |
| f"Critical field '{field}' has {missing_pct:.2f}% missing values" | |
| ) | |
| results['passed'] = False | |
| # 2. Check timestamp continuity (should be hourly) | |
| time_diffs = df['timestamp'].diff().dt.total_hours() | |
| non_hourly = (time_diffs != 1).sum() | |
| if non_hourly > 0: | |
| results['warnings'].append( | |
| f"Found {non_hourly} non-hourly gaps in timestamps" | |
| ) | |
| # 3. Check RAM physical constraints | |
| invalid_ram = (df['ram_after'] > df['fmax']).sum() | |
| if invalid_ram > 0: | |
| results['errors'].append( | |
| f"Found {invalid_ram} records where RAM > Fmax (physically impossible)" | |
| ) | |
| results['passed'] = False | |
| negative_ram = (df['ram_after'] < 0).sum() | |
| if negative_ram > 0: | |
| results['warnings'].append( | |
| f"Found {negative_ram} records with negative RAM (clipped to 0)" | |
| ) | |
| # 4. Check PTDF bounds | |
| if 'ptdf_value' in df.columns: | |
| out_of_bounds = ((df['ptdf_value'] < -1.5) | (df['ptdf_value'] > 1.5)).sum() | |
| if out_of_bounds > 0: | |
| results['warnings'].append( | |
| f"Found {out_of_bounds} PTDF values outside [-1.5, 1.5] (clipped)" | |
| ) | |
| # 5. Check shadow price reasonableness | |
| extreme_shadows = (df['shadow_price'] > 1000).sum() | |
| if extreme_shadows > 0: | |
| results['warnings'].append( | |
| f"Found {extreme_shadows} shadow prices > €1000/MW (check for outliers)" | |
| ) | |
| # 6. Check for duplicate records | |
| duplicates = df.select(['timestamp', 'cnec_id']).is_duplicated().sum() | |
| if duplicates > 0: | |
| results['errors'].append( | |
| f"Found {duplicates} duplicate (timestamp, cnec_id) pairs" | |
| ) | |
| results['passed'] = False | |
| # 7. Check data completeness by month | |
| monthly_counts = df.groupby(df['timestamp'].dt.month()).count() | |
| expected_hours_per_month = {1: 744, 2: 672, 3: 744, ...} # Account for leap years | |
| for month, count in monthly_counts.items(): | |
| expected = expected_hours_per_month.get(month, 720) | |
| if count < expected * 0.95: # Allow 5% missing | |
| results['warnings'].append( | |
| f"Month {month} has only {count}/{expected} expected hours" | |
| ) | |
| return results | |
| ``` | |
| ### 9.2 Feature Validation | |
| ```python | |
| def validate_features(features: np.ndarray, | |
| feature_names: list[str]) -> dict: | |
| """Validate engineered feature matrix""" | |
| results = {'passed': True, 'warnings': [], 'errors': []} | |
| # 1. Check for NaN/Inf | |
| nan_cols = np.isnan(features).any(axis=0) | |
| if nan_cols.any(): | |
| nan_features = [feature_names[i] for i, is_nan in enumerate(nan_cols) if is_nan] | |
| results['errors'].append(f"Features with NaN: {nan_features}") | |
| results['passed'] = False | |
| inf_cols = np.isinf(features).any(axis=0) | |
| if inf_cols.any(): | |
| inf_features = [feature_names[i] for i, is_inf in enumerate(inf_cols) if is_inf] | |
| results['errors'].append(f"Features with Inf: {inf_features}") | |
| results['passed'] = False | |
| # 2. Check feature variance (avoid zero-variance features) | |
| variances = np.var(features, axis=0) | |
| zero_var = variances < 1e-8 | |
| if zero_var.any(): | |
| zero_var_features = [feature_names[i] for i, is_zero in enumerate(zero_var) if is_zero] | |
| results['warnings'].append( | |
| f"Features with near-zero variance: {zero_var_features}" | |
| ) | |
| # 3. Check for features outside expected ranges | |
| for i, fname in enumerate(feature_names): | |
| if 'margin_ratio' in fname or 'percentile' in fname: | |
| # Should be in [0, 1] | |
| if (features[:, i] < -0.1).any() or (features[:, i] > 1.1).any(): | |
| results['warnings'].append( | |
| f"Feature {fname} has values outside [0, 1]" | |
| ) | |
| return results | |
| ``` | |
| ### 9.3 Automated Testing | |
| ```python | |
| def run_jao_data_pipeline_tests(): | |
| """Comprehensive test suite for JAO data pipeline""" | |
| import pytest | |
| class TestJAOPipeline: | |
| def test_data_download(self): | |
| """Test JAOPuTo download completes successfully""" | |
| # Mock test or actual small date range | |
| pass | |
| def test_cnec_masking(self): | |
| """Test CNEC masking creates correct structure""" | |
| # Create synthetic data with missing CNECs | |
| # Verify mask=0 for missing, mask=1 for present | |
| pass | |
| def test_ptdf_pca(self): | |
| """Test PTDF dimensionality reduction""" | |
| # Verify variance explained >90% | |
| # Verify output shape (n_hours, 10) | |
| pass | |
| def test_ram_normalization(self): | |
| """Test RAM normalization doesn't exceed bounds""" | |
| # Verify 0 <= margin_ratio <= 1 | |
| pass | |
| def test_feature_engineering_shape(self): | |
| """Test complete feature matrix has correct shape""" | |
| # Verify (512, 70) for historical context | |
| pass | |
| def test_no_data_leakage(self): | |
| """Verify no future data leaks into historical features""" | |
| # Check that features at time T only use data up to time T | |
| pass | |
| pytest.main([__file__]) | |
| ``` | |
| --- | |
| ## Summary: Data Collection Checklist | |
| ### Day 1 Data Collection (9.5 hours - includes outage integration) | |
| **Morning (4.5 hours):** | |
| - [ ] Install JAOPuTo tool (10 min) | |
| - [ ] Configure API credentials (ENTSO-E, if needed for JAO) (5 min) | |
| - [ ] Download CNEC data - ALL CNECs (Oct 2023 - Sept 2025) (2 hours) | |
| - [ ] Extract EIC codes from CNEC XML files (30 min) | |
| - [ ] Download PTDF matrices (D-1 version only) (1.5 hours) | |
| - [ ] Initial data validation checks (15 min) | |
| **Afternoon (5 hours):** | |
| - [ ] Download RAM values (1.5 hours) | |
| - [ ] Download shadow prices (1 hour) | |
| - [ ] Download presolved flags (included with CNECs) | |
| - [ ] Collect ENTSO-E outage data via API (A78 document type) (20 min) | |
| - [ ] Execute EIC-based CNEC-to-outage matching (exact + fuzzy) (15 min) | |
| - [ ] Generate outage time series for all downloaded CNECs (15 min) | |
| - [ ] Run cleaning procedures on all datasets (1 hour) | |
| - [ ] Post-hoc analysis: Identify top 200 CNECs (50 + 150) (45 min) | |
| - [ ] Create CNEC master set with masking (30 min) | |
| - [ ] Engineer hybrid PTDF features (individual + aggregated) (20 min) | |
| **End of Day 1 Deliverables:** | |
| ``` | |
| data/raw/ | |
| ââ€Âωâ€Âۉâ€Â€ jao/ | |
| ââ€Â‚ ââ€Âωâ€Âۉâ€Â€ cnecs_all_2024_2025.parquet (~2-3 GB, all ~2000 CNECs) | |
| ââ€Â‚ ââ€Âωâ€Âۉâ€Â€ ptdfs_2024_2025.parquet (~800 MB, D-1 version) | |
| ââ€Â‚ ââ€Âωâ€Âۉâ€Â€ rams_2024_2025.parquet (~400 MB) | |
| ââ€Â‚ ââ€Ââ€Âââ€Âۉâ€Â€ shadow_prices_2024_2025.parquet (~300 MB) | |
| ââ€Âωâ€Âۉâ€Â€ entsoe/ | |
| ââ€Â‚ ââ€Ââ€Âââ€Âۉâ€Â€ outages_12m.parquet (~150 MB, A78 transmission outages) | |
| data/processed/ | |
| ââ€Âωâ€Âۉâ€Â€ cnec_eic_lookup.parquet (~5 MB, all CNECs with EIC codes) | |
| ââ€Âωâ€Âۉâ€Â€ cnec_outage_matched.parquet (~5 MB, CNEC→outage EIC mapping) | |
| ââ€Âωâ€Âۉâ€Â€ outage_time_series_all_cnecs.parquet (~200 MB, hourly outage indicators) | |
| ââ€Âωâ€Âۉâ€Â€ top_200_cnecs.json (List of selected CNECs: 50 + 150) | |
| ââ€Âωâ€Âۉâ€Â€ cnec_impact_scores.parquet (Ranking criteria for all CNECs) | |
| ââ€Âωâ€Âۉâ€Â€ jao_cnecs_cleaned.parquet (~500 MB, top 200 CNECs only) | |
| ââ€Âωâ€Âۉâ€Â€ jao_ptdfs_compressed.parquet (~50 MB, 10 PCA components) | |
| ââ€Âωâ€Âۉâ€Â€ jao_rams_normalized.parquet (~80 MB, top 200 CNECs only) | |
| ââ€Âωâ€Âۉâ€Â€ jao_shadow_prices_cleaned.parquet (~60 MB, top 200 CNECs only) | |
| ââ€Âωâ€Âۉâ€Â€ cnec_master_set_200.parquet (Complete time series, with masking) | |
| ââ€Ââ€Âââ€Âۉâ€Â€ ptdf_pca_model.pkl (Fitted PCA for future transforms) | |
| reports/ | |
| ââ€Âωâ€Âۉâ€Â€ data_quality_report.json (Validation results) | |
| ââ€Âωâ€Âۉâ€Â€ eic_matching_report.json (Match rates, fuzzy stats) | |
| ââ€Ââ€Âââ€Âۉâ€Â€ cnec_selection_analysis.html (Top 200 selection justification) | |
| ``` | |
| **Key Metrics to Report:** | |
| - Total unique CNECs downloaded: ~2,000 | |
| - Top 200 selected: 50 (full detail) + 150 (selective detail) | |
| - EIC exact match rate: ~85-95% | |
| - EIC total match rate (with fuzzy): >95% | |
| - Data completeness: >98% for critical fields | |
| - Outliers detected and cleaned: <1% of records | |
| --- | |
| ## Next Steps | |
| **Day 2:** Combine JAO features with: | |
| - ENTSO-E actual generation/load/cross-border flow data | |
| - OpenMeteo weather forecasts (52 grid points) | |
| - Create complete feature matrix: | |
| - 730 historical context features | |
| - 280 future covariate features | |
| - 50 real-time system aggregates | |
| **Day 3:** Feature validation and zero-shot inference preparation | |
| **Day 4:** Zero-shot inference with Chronos 2 (multivariate forecasting) | |
| **Day 5:** Evaluation, documentation, and handover | |
| --- | |
| ## 10. ENTSO-E Outage Data via EIC Matching | |
| ### 10.1 EIC Code System Overview | |
| **What are EIC Codes?** | |
| - **Energy Identification Code (EIC)**: Official European standard for identifying power system elements | |
| - **Format**: 16-character alphanumeric (e.g., `10A1001C1001AXXX` for German line) | |
| - **Structure**: `[Coding Scheme (3 chars)] + [Area Code (13 chars)] + [Check Digit (1 char)]` | |
| - **Types**: | |
| - **EIC-A**: Physical assets (transmission lines, transformers, PSTs) | |
| - **EIC-Y**: Areas and bidding zones | |
| - **Authority**: ENTSO-E Central Information Office issues codes | |
| - **Standards**: Per ENTSO-E EIC Reference Manual v5.5, JAO Handbook v2.2 | |
| **Why EIC Matching Works:** | |
| - Both JAO CNECs and ENTSO-E outages use EIC codes as primary identifiers | |
| - JAO Handbook v2.2 (pp. 15-18): CNECs contain `<EIC_Code>` field | |
| - ENTSO-E Transparency Platform: Outages include "Outage Equipment EIC" | |
| - 85-95% exact match rate (per ENTSO-E implementation guides) | |
| - Superior to name-based matching (avoids TSO naming variations) | |
| ### 10.2 JAO CNEC EIC Extraction | |
| **Data Source:** `Core_DA_CC_CNEC_[date].xml` (JAO daily publication) | |
| **Fields to Extract:** | |
| ```xml | |
| <CNEC> | |
| <CNEC_ID>DE_CZ_TIE_001</CNEC_ID> | |
| <EIC_Code>10A1001C1001AXXX</EIC_Code> <!-- Primary identifier --> | |
| <BranchEIC>10T-DE-XXXXXXX</BranchEIC> <!-- For multi-branch --> | |
| <NamePerConvention>Line Röhrsdorf-Hradec 380kV N-1</NamePerConvention> | |
| <VoltageLevel>380</VoltageLevel> | |
| <TSO>50Hertz</TSO> | |
| <FromSubstation> | |
| <Name>Röhrsdorf</Name> | |
| <EIC>10YDE-XXXXXXX</EIC> | |
| </FromSubstation> | |
| <ToSubstation> | |
| <Name>Hradec</Name> | |
| <EIC>10YCZ-XXXXXXX</EIC> | |
| </ToSubstation> | |
| </CNEC> | |
| ``` | |
| **Preprocessing Script:** | |
| ```python | |
| import xmltodict | |
| import polars as pl | |
| from pathlib import Path | |
| def extract_cnec_eic_codes(jao_data_dir: Path) -> pl.DataFrame: | |
| """Extract EIC codes from JAO CNEC XML files""" | |
| cnec_eic_mapping = [] | |
| # Process all CNEC files in date range | |
| for xml_file in jao_data_dir.glob('Core_DA_CC_CNEC_*.xml'): | |
| with open(xml_file, 'r') as f: | |
| data = xmltodict.parse(f.read()) | |
| for cnec in data['CNECs']['CNEC']: | |
| cnec_eic_mapping.append({ | |
| 'cnec_id': cnec['@id'], | |
| 'eic_code': cnec.get('EIC_Code'), | |
| 'branch_eic': cnec.get('BranchEIC'), | |
| 'line_name': cnec.get('NamePerConvention'), | |
| 'voltage_level': int(cnec.get('VoltageLevel', 0)), | |
| 'tso': cnec.get('TSO'), | |
| 'from_substation': cnec['FromSubstation']['Name'], | |
| 'to_substation': cnec['ToSubstation']['Name'], | |
| 'from_substation_eic': cnec['FromSubstation'].get('EIC'), | |
| 'to_substation_eic': cnec['ToSubstation'].get('EIC'), | |
| 'contingency': cnec.get('Contingency', {}).get('Name') | |
| }) | |
| # Create lookup table (deduplicate CNECs) | |
| cnec_eic_df = pl.DataFrame(cnec_eic_mapping).unique(subset=['cnec_id']) | |
| return cnec_eic_df | |
| # Execute | |
| cnec_eic_lookup = extract_cnec_eic_codes(Path('data/raw/jao/')) | |
| cnec_eic_lookup.write_parquet('data/processed/cnec_eic_lookup.parquet') | |
| print(f"Extracted EIC codes for {len(cnec_eic_lookup)} unique CNECs") | |
| ``` | |
| ### 10.3 ENTSO-E Outage Data Collection | |
| **Data Source:** ENTSO-E Transparency Platform API | |
| - **Document Type:** A78 - "Unavailability of Transmission Infrastructure" | |
| - **Coverage:** Planned and forced outages for transmission elements (lines, transformers, HVDC) | |
| - **Frequency:** Hourly updates, historical data back to 2015 | |
| - **Domain:** Core CCR region (EIC: `10Y1001A1001A83F`) | |
| **API Collection:** | |
| ```python | |
| from entsoe import EntsoePandasClient | |
| import polars as pl | |
| import pandas as pd | |
| def collect_entsoe_outages(api_key: str, start_date: str, end_date: str) -> pl.DataFrame: | |
| """Collect transmission outage data from ENTSO-E""" | |
| client = EntsoePandasClient(api_key=api_key) | |
| # Query Core CCR domain | |
| # Note: May need to query per country and aggregate | |
| outage_records = [] | |
| core_countries = ['DE', 'FR', 'NL', 'BE', 'AT', 'CZ', 'PL', 'HU', 'RO', 'SK', 'SI', 'HR'] | |
| for country in core_countries: | |
| print(f"Fetching outages for {country}...") | |
| try: | |
| # Query unavailability (A78 document type) | |
| outages = client.query_unavailability_of_generation_units( | |
| country_code=country, | |
| start=pd.Timestamp(start_date, tz='UTC'), | |
| end=pd.Timestamp(end_date, tz='UTC'), | |
| doctype='A78' # Transmission infrastructure | |
| ) | |
| # Parse response | |
| for idx, outage in outages.iterrows(): | |
| outage_records.append({ | |
| 'outage_eic': outage.get('affected_unit_eic'), | |
| 'line_name': outage.get('affected_unit_name'), | |
| 'voltage_kv': outage.get('nominal_power'), # For lines, often voltage | |
| 'tso': outage.get('tso'), | |
| 'country': country, | |
| 'outage_type': outage.get('type'), # 'A53' (planned) or 'A54' (forced) | |
| 'start_time': outage.get('start'), | |
| 'end_time': outage.get('end'), | |
| 'available_capacity_mw': outage.get('available_capacity'), | |
| 'unavailable_capacity_mw': outage.get('unavailable_capacity'), | |
| 'status': outage.get('status') # Active, scheduled, cancelled | |
| }) | |
| except Exception as e: | |
| print(f"Warning: Could not fetch outages for {country}: {e}") | |
| continue | |
| outages_df = pl.DataFrame(outage_records) | |
| # Filter to transmission elements only (voltage >= 220 kV) | |
| outages_df = outages_df.filter( | |
| (pl.col('voltage_kv') >= 220) | pl.col('voltage_kv').is_null() | |
| ) | |
| return outages_df | |
| # Execute | |
| outages = collect_entsoe_outages( | |
| api_key='YOUR_ENTSOE_KEY', | |
| start_date='2023-10-01', | |
| end_date='2025-09-30' | |
| ) | |
| outages.write_parquet('data/raw/entsoe_outages_12m.parquet') | |
| print(f"Collected {len(outages)} outage records") | |
| ``` | |
| ### 10.4 EIC-Based Matching (Primary + Fuzzy Fallback) | |
| **Matching Strategy:** | |
| 1. **Primary:** Exact EIC code matching (85-95% success rate) | |
| 2. **Fallback:** Fuzzy matching on line name + substations + voltage (5-15% remaining) | |
| 3. **Result:** >95% total match rate | |
| ```python | |
| import polars as pl | |
| from fuzzywuzzy import fuzz, process | |
| def match_cnecs_to_outages(cnec_eic: pl.DataFrame, | |
| outages: pl.DataFrame) -> pl.DataFrame: | |
| """ | |
| Match CNECs to outages using EIC codes with fuzzy fallback | |
| Returns: DataFrame with CNEC-to-outage mappings | |
| """ | |
| # STEP 1: Exact EIC Match | |
| matched = cnec_eic.join( | |
| outages.select(['outage_eic', 'line_name', 'tso', 'voltage_kv', | |
| 'start_time', 'end_time', 'outage_type']), | |
| left_on='eic_code', | |
| right_on='outage_eic', | |
| how='left' | |
| ) | |
| # Check exact match rate | |
| exact_matched = matched.filter(pl.col('outage_eic').is_not_null()) | |
| exact_match_rate = len(exact_matched) / len(matched) | |
| print(f"Exact EIC match rate: {exact_match_rate:.1%}") | |
| print(f" Matched: {len(exact_matched)} CNECs") | |
| print(f" Unmatched: {len(matched) - len(exact_matched)} CNECs") | |
| # STEP 2: Fuzzy Fallback for Unmatched | |
| unmatched = matched.filter(pl.col('outage_eic').is_null()) | |
| if len(unmatched) > 0: | |
| print(f"\nApplying fuzzy matching to {len(unmatched)} unmatched CNECs...") | |
| # Prepare search corpus from outages | |
| outage_search_corpus = [] | |
| for row in outages.iter_rows(named=True): | |
| search_str = f"{row['line_name']} {row['tso']} {row['voltage_kv']}kV" | |
| outage_search_corpus.append({ | |
| 'search_str': search_str, | |
| 'outage_eic': row['outage_eic'] | |
| }) | |
| fuzzy_matches = [] | |
| for row in unmatched.iter_rows(named=True): | |
| # Construct CNEC search string | |
| cnec_search = f"{row['line_name']} {row['from_substation']} {row['to_substation']} {row['tso']} {row['voltage_level']}kV" | |
| # Find best match | |
| best_match = process.extractOne( | |
| cnec_search, | |
| [item['search_str'] for item in outage_search_corpus], | |
| scorer=fuzz.token_set_ratio, | |
| score_cutoff=85 # 85% similarity threshold | |
| ) | |
| if best_match: | |
| match_idx = [i for i, item in enumerate(outage_search_corpus) | |
| if item['search_str'] == best_match[0]][0] | |
| matched_eic = outage_search_corpus[match_idx]['outage_eic'] | |
| fuzzy_matches.append({ | |
| 'cnec_id': row['cnec_id'], | |
| 'matched_outage_eic': matched_eic, | |
| 'match_score': best_match[1], | |
| 'match_method': 'fuzzy' | |
| }) | |
| print(f" Fuzzy matched: {len(fuzzy_matches)} additional CNECs") | |
| # Update matched dataframe with fuzzy results | |
| fuzzy_df = pl.DataFrame(fuzzy_matches) | |
| matched = matched.join( | |
| fuzzy_df, | |
| on='cnec_id', | |
| how='left' | |
| ).with_columns([ | |
| pl.when(pl.col('outage_eic').is_null()) | |
| .then(pl.col('matched_outage_eic')) | |
| .otherwise(pl.col('outage_eic')) | |
| .alias('outage_eic') | |
| ]) | |
| # Final match statistics | |
| final_matched = matched.filter(pl.col('outage_eic').is_not_null()) | |
| final_match_rate = len(final_matched) / len(matched) | |
| print(f"\nFinal match rate: {final_match_rate:.1%}") | |
| print(f" Total matched: {len(final_matched)} CNECs") | |
| print(f" Unmatched: {len(matched) - len(final_matched)} CNECs") | |
| return matched | |
| # Execute matching | |
| cnec_eic = pl.read_parquet('data/processed/cnec_eic_lookup.parquet') | |
| outages = pl.read_parquet('data/raw/entsoe_outages_12m.parquet') | |
| matched_cnecs = match_cnecs_to_outages(cnec_eic, outages) | |
| matched_cnecs.write_parquet('data/processed/cnec_outage_matched.parquet') | |
| ``` | |
| **Expected Output:** | |
| ``` | |
| Exact EIC match rate: 87.5% | |
| Matched: 175 CNECs | |
| Unmatched: 25 CNECs | |
| Applying fuzzy matching to 25 unmatched CNECs... | |
| Fuzzy matched: 18 additional CNECs | |
| Final match rate: 96.5% | |
| Total matched: 193 CNECs | |
| Unmatched: 7 CNECs | |
| ``` | |
| ### 10.5 Outage Feature Engineering | |
| **Outage Time Series Generation:** | |
| ```python | |
| def create_outage_time_series(matched_cnecs: pl.DataFrame, | |
| outages: pl.DataFrame, | |
| timestamps: pl.Series) -> pl.DataFrame: | |
| """ | |
| Create binary outage indicators for each CNEC over time | |
| For each timestamp and CNEC: | |
| - outage_active = 1 if element under outage | |
| - outage_active = 0 if element operational | |
| - outage_type = 'planned' or 'forced' if active | |
| Returns: (n_timestamps Ã n_cnecs) DataFrame | |
| """ | |
| outage_series = [] | |
| for cnec in matched_cnecs.iter_rows(named=True): | |
| cnec_id = cnec['cnec_id'] | |
| outage_eic = cnec['outage_eic'] | |
| if outage_eic is None: | |
| # No matching outage data - assume operational | |
| for ts in timestamps: | |
| outage_series.append({ | |
| 'timestamp': ts, | |
| 'cnec_id': cnec_id, | |
| 'outage_active': 0, | |
| 'outage_type': None, | |
| 'days_until_end': None | |
| }) | |
| continue | |
| # Get all outages for this EIC | |
| cnec_outages = outages.filter(pl.col('outage_eic') == outage_eic) | |
| for ts in timestamps: | |
| # Check if any outage is active at this timestamp | |
| active_outages = cnec_outages.filter( | |
| (pl.col('start_time') <= ts) & (pl.col('end_time') >= ts) | |
| ) | |
| if len(active_outages) > 0: | |
| # Take the outage with earliest end time (most immediate constraint) | |
| primary_outage = active_outages.sort('end_time').head(1) | |
| outage_row = primary_outage.row(0, named=True) | |
| days_remaining = (outage_row['end_time'] - ts).total_seconds() / 86400 | |
| outage_series.append({ | |
| 'timestamp': ts, | |
| 'cnec_id': cnec_id, | |
| 'outage_active': 1, | |
| 'outage_type': outage_row['outage_type'], | |
| 'days_until_end': days_remaining, | |
| 'unavailable_capacity_mw': outage_row.get('unavailable_capacity_mw') | |
| }) | |
| else: | |
| # No active outage | |
| outage_series.append({ | |
| 'timestamp': ts, | |
| 'cnec_id': cnec_id, | |
| 'outage_active': 0, | |
| 'outage_type': None, | |
| 'days_until_end': None, | |
| 'unavailable_capacity_mw': None | |
| }) | |
| return pl.DataFrame(outage_series) | |
| # Generate outage time series for all 200 CNECs | |
| timestamps = pl.date_range( | |
| start='2023-10-01', | |
| end='2025-09-30', | |
| interval='1h' | |
| ) | |
| outage_features = create_outage_time_series( | |
| matched_cnecs=pl.read_parquet('data/processed/cnec_outage_matched.parquet'), | |
| outages=pl.read_parquet('data/raw/entsoe_outages_12m.parquet'), | |
| timestamps=timestamps | |
| ) | |
| outage_features.write_parquet('data/processed/outage_time_series_200cnecs.parquet') | |
| print(f"Generated outage features: {outage_features.shape}") | |
| ``` | |
| **Outage Feature Categories:** | |
| **1. CNEC-Level Binary Indicators (200 features for all CNECs):** | |
| - `outage_active_cnec_001` through `outage_active_cnec_200` | |
| - Value: 1 if element under outage, 0 if operational | |
| - Available as both historical context and future covariates (planned outages) | |
| **2. Border-Level Aggregations (20 features):** | |
| - `outage_count_de_cz`: Number of CNECs on DE-CZ border with active outages | |
| - `outage_mw_de_cz`: Total unavailable capacity on DE-CZ border | |
| - `forced_outage_ratio_de_cz`: Ratio of forced vs planned outages | |
| - Repeat for 10 major borders Ã 2 metrics = 20 features | |
| **3. System-Level Aggregations (8 features):** | |
| - `total_outage_mw_fbmc`: Total transmission capacity unavailable | |
| - `n_cnecs_with_outage`: Count of CNECs affected by outages | |
| - `forced_outage_count`: Count of unplanned outages (stress indicator) | |
| - `max_outage_duration_remaining`: Days until longest outage ends | |
| - `avg_outage_duration`: Average remaining outage duration | |
| - `outage_stress_index`: Weighted measure (count Ã avg_duration Ã forced_ratio) | |
| - `critical_line_outage_count`: Count of high-voltage (380kV) outages | |
| - `outage_geographic_spread`: Number of unique borders affected | |
| **4. Outage Duration Features (Top 50 CNECs - 150 features):** | |
| For each of the Top 50 CNECs, calculate three temporal features: | |
| - `outage_elapsed_cnec_[ID]`: Hours elapsed since outage started | |
| - Calculation: `(current_timestamp - start_time).total_hours()` | |
| - Value: 0 if no active outage | |
| - `outage_remaining_cnec_[ID]`: Hours remaining until outage ends | |
| - Calculation: `(end_time - current_timestamp).total_hours()` | |
| - Value: 0 if no active outage | |
| - `outage_total_duration_cnec_[ID]`: Total planned duration of active outage | |
| - Calculation: `(end_time - start_time).total_hours()` | |
| - Value: 0 if no active outage | |
| **Rationale:** | |
| - Binary `outage_active` doesn't capture temporal stress accumulation | |
| - A 2-hour outage has different impact than 48-hour outage | |
| - Enables model to learn: "When outage_elapsed > 24h, constraint severity increases" | |
| - Direct causal chain: duration → network stress → RAM reduction → Max BEX impact | |
| - Research indicates ~15% of BEX anomalies correlate with extended outage durations | |
| **5. Outage Duration Aggregates (Tier-2 150 CNECs - 30 features):** | |
| For tier-2 CNECs, aggregate duration metrics by border: | |
| - `avg_outage_duration_[BORDER]_tier2`: Average duration of active outages (hours) | |
| - `max_outage_duration_[BORDER]_tier2`: Longest active outage on border (hours) | |
| - `forced_outage_duration_ratio_[BORDER]_tier2`: (forced_duration / total_duration) | |
| - 10 major borders Ãâ€" 3 metrics = 30 features | |
| **Future Covariates (Planned Outages):** | |
| - Same 200 CNEC-level binary indicators for D+1 to D+14 horizon | |
| - Only includes planned/scheduled outages (status = 'scheduled' in ENTSO-E) | |
| - These are KNOWN with certainty (gold for forecasting) | |
| ### 10.6 Integration with Main Pipeline | |
| **Day 1 Timeline Addition:** | |
| - **Morning (+30 min):** Extract EIC codes from JAO CNEC files | |
| - **Morning (+20 min):** Collect ENTSO-E outage data via API | |
| - **Afternoon (+15 min):** Execute EIC-based matching (exact + fuzzy) | |
| - **Afternoon (+15 min):** Generate outage time series for 200 CNECs | |
| **Total Additional Time: ~1.5 hours** | |
| **Data Flow:** | |
| ``` | |
| JAO CNEC XML → Extract EIC codes → cnec_eic_lookup.parquet | |
| ↓ | |
| ENTSO-E API → Outage data → entsoe_outages_12m.parquet | |
| ↓ | |
| EIC Matching (exact + fuzzy) | |
| ↓ | |
| cnec_outage_matched.parquet | |
| ↓ | |
| Generate time series for each CNEC | |
| ↓ | |
| outage_time_series_200cnecs.parquet | |
| ↓ | |
| Feature Engineering Pipeline (Day 2) | |
| ``` | |
| --- | |
| ## 11. Final Feature Architecture (Option A with Tier-2 Binding) | |
| ### 11.1 Feature Count Summary | |
| **Total Features: ~1,735** (Hybrid PTDF + Max BEX + LTN + Net Positions + ATC + Outage Duration) | |
| - Historical Context: ~1,000 features | |
| - Future Covariates: ~380 features | |
| **New Feature Categories Added:** | |
| - Target History (Max BEX): +20 features (historical context) | |
| - LTN Allocations: +40 features (20 historical + 20 future covariates) | |
| - Net Position Features: +48 features (24 min + 24 max values) | |
| - Non-Core ATC: +28 features (14 borders × 2 directions) | |
| ### 11.2 Historical Context Features (1,000 features) | |
| #### **Top 50 CNECs - Full Detail + Individual PTDFs (1,000 features)** | |
| For each of the 50 most impactful CNECs, 8 core metrics + 12 PTDF sensitivities: | |
| - `ram_after_cnec_[ID]`: RAM value (MW) | |
| - `margin_ratio_cnec_[ID]`: RAM / Fmax (normalized) | |
| - `presolved_cnec_[ID]`: Binding status (1 = binding, 0 = not binding) | |
| - `shadow_price_cnec_[ID]`: Congestion cost (€/MW) | |
| - `outage_active_cnec_[ID]`: Outage status (1 = element under outage, 0 = operational) | |
| - `outage_elapsed_cnec_[ID]`: Hours elapsed since outage started (0 if no outage) | |
| - `outage_remaining_cnec_[ID]`: Hours remaining until outage ends (0 if no outage) | |
| - `outage_total_duration_cnec_[ID]`: Total planned duration of active outage (0 if no outage) | |
| **Total Top 50 CNEC Features: 50 CNECs Ãâ€" (8 core metrics + 12 PTDF sensitivities) = 1,000 features** | |
| **Selection Criteria for Top 50:** | |
| ```python | |
| top_50_score = ( | |
| 0.25 Ã (days_appeared / 365) + # Consistency | |
| 0.30 Ã (times_binding / times_appeared) + # Binding frequency | |
| 0.20 Ã (avg_shadow_price / 100) + # Economic impact | |
| 0.15 Ã (hours_ram_low / total_hours) + # Operational tightness | |
| 0.10 Ã geographic_importance # Border coverage | |
| ) | |
| ``` | |
| #### **Tier-2 150 CNECs - Selective Detail (300 features)** | |
| **Individual Binary Indicators (300 features):** | |
| - `presolved_cnec_[ID]`: Binding status for each of 150 CNECs (150 features) | |
| - Preserves constraint activation patterns | |
| - Model learns which tier-2 CNECs bind under which conditions | |
| - `outage_active_cnec_[ID]`: Outage status for each of 150 CNECs (150 features) | |
| - Preserves EIC matching benefit | |
| - Future covariates: planned outages known ahead | |
| - Model learns outage → binding relationships | |
| **Rationale for Preserving These Two:** | |
| - Both are **discrete/binary** (low redundancy across CNECs) | |
| - Both are **high-signal**: | |
| - `presolved`: Indicates current constraint state | |
| - `outage_active`: Predicts future constraint likelihood | |
| - Allows learning **cross-CNEC interactions**: "When CNEC_X has outage, CNEC_Y binds" | |
| **Aggregated Continuous Metrics (60 features):** | |
| Grouped by border/region for remaining metrics: | |
| - `avg_ram_de_cz_tier2`: Average RAM for tier-2 DE-CZ CNECs | |
| - `avg_margin_ratio_de_cz_tier2`: Average margin ratio | |
| - `total_shadow_price_de_cz_tier2`: Sum of shadow prices | |
| - `ram_volatility_de_cz_tier2`: Standard deviation of RAM | |
| - `avg_outage_duration_de_cz_tier2`: Average duration of active outages (hours) | |
| - `max_outage_duration_de_cz_tier2`: Longest active outage on border (hours) | |
| - Repeat for 10 major borders Ãâ€" 6 metrics = 60 features | |
| **Why Aggregate These:** | |
| - **RAM, shadow prices** are continuous and correlated within a region | |
| - Preserves regional capacity patterns without full redundancy | |
| - Reduces from 150 CNECs Ã 2 metrics = 300 → 30 aggregate features | |
| #### **PTDF Patterns (10 features)** | |
| - `ptdf_pc1` through `ptdf_pc10`: Principal components | |
| - Compressed from (200 CNECs Ã 12 zones = 2,400 values) → 10 components | |
| - Captures ~92% of PTDF variance | |
| #### **Border Capacity Historical (20 features)** | |
| - `capacity_hist_de_fr`, `capacity_hist_de_nl`, etc. | |
| - One feature per FBMC border | |
| - Actual historical cross-border flow capacity (from ENTSO-E) | |
| #### **System-Level Aggregates (20 features)** | |
| - `system_min_margin`: Tightest CNEC margin across all 200 | |
| - `n_binding_cnecs_total`: Count of binding CNECs | |
| - `n_binding_cnecs_top50`: Count of top-50 CNECs binding | |
| - `n_binding_cnecs_tier2`: Count of tier-2 CNECs binding | |
| - `margin_std`: Standard deviation of margins | |
| - `total_congestion_cost`: Sum of all shadow prices | |
| - `max_shadow_price`: Highest shadow price | |
| - `avg_shadow_price_binding`: Average price when CNECs bind | |
| - `total_outage_mw_fbmc`: Total capacity under outage | |
| - `n_cnecs_with_outage`: Count of CNECs affected | |
| - `forced_outage_count`: Unplanned outages | |
| - `outage_stress_index`: Composite stress metric | |
| - Additional 8 features (criticality scores, violation counts, etc.) | |
| #### **Temporal Features (10 features)** | |
| - `hour_of_day`: 0-23 | |
| - `day_of_week`: 0-6 (Monday=0) | |
| - `month`: 1-12 | |
| - `day_of_year`: 1-365 | |
| - `is_weekend`: Binary | |
| - `is_peak_hour`: Binary (8am-8pm) | |
| - `is_holiday_de`, `is_holiday_fr`, `is_holiday_be`, `is_holiday_nl` | |
| #### **Weather Features (50 features)** | |
| - Key grid points (10-12 strategic locations) | |
| - Per point: temperature, wind speed (10m, 100m), wind direction, solar radiation, cloud cover | |
| - ~5 metrics Ã 10 points = 50 features | |
| #### **Target History Features (20 features) - NEW** | |
| - `max_bex_hist_[BORDER]`: Historical Max BEX per border (20 FBMC Core borders) | |
| - Used in context window (past 21 days) | |
| - Model learns patterns in capacity evolution | |
| - Example features: `max_bex_hist_de_fr`, `max_bex_hist_de_nl`, etc. | |
| #### **LTN Allocation Features (20 features) - NEW** | |
| - `ltn_allocated_[BORDER]`: Long-term capacity already committed per border | |
| - Values from yearly/monthly JAO auctions | |
| - Used in both historical context (what was allocated) and future covariates (known allocations ahead) | |
| - Example: If 500 MW LTN on DE-FR, Max BEX will be ~500 MW lower | |
| #### **Net Position Features (48 features) - NEW** | |
| - `net_pos_min_[ZONE]`: Minimum feasible net position per zone (12 features) | |
| - `net_pos_max_[ZONE]`: Maximum feasible net position per zone (12 features) | |
| - `net_pos_range_[ZONE]`: Degrees of freedom (max - min) per zone (12 features) | |
| - `net_pos_margin_[ZONE]`: Utilization ratio per zone (12 features) | |
| - Zones: DE_LU, FR, BE, NL, AT, CZ, PL, SK, HU, SI, HR, RO | |
| #### **Non-Core ATC Features (28 features) - NEW** | |
| - `atc_[NON_CORE_BORDER]_forward`: Forward direction capacity (14 features) | |
| - `atc_[NON_CORE_BORDER]_backward`: Backward direction capacity (14 features) | |
| - Key borders: FR-UK, FR-ES, FR-CH, DE-CH, DE-DK, AT-CH, AT-IT, PL-SE, PL-LT, etc. | |
| - These capture loop flow drivers that affect Core CNECs | |
| #### **Interaction Features (40 features)** | |
| - `high_wind_low_margin`: Interaction of wind > threshold & margin < threshold | |
| - `weekend_low_demand_pattern`: Weekend Ã low demand indicator | |
| - `outage_binding_correlation`: Correlation of outage presence with binding events | |
| - Various cross-feature products and ratios | |
| ### 11.3 Future Covariates (280 features) | |
| #### **Top 50 CNEC Planned Outages (50 features)** | |
| - `planned_outage_cnec_[ID]`: Binary indicator for D+1 to D+14 | |
| - Known with certainty (scheduled maintenance) | |
| #### **Tier-2 150 CNEC Planned Outages (150 features)** | |
| - `planned_outage_cnec_[ID]`: Binary for each tier-2 CNEC | |
| - Preserves full EIC matching benefit for future horizon | |
| #### **Weather Forecasts (50 features)** | |
| - Same structure as historical weather | |
| - OpenMeteo provides 14-day forecasts | |
| #### **Temporal (10 features)** | |
| - Same temporal features projected for D+1 to D+14 | |
| - Known with certainty | |
| #### **LTN Future Allocations (20 features) - NEW** | |
| - `ltn_allocated_[BORDER]_future`: Known LT capacity allocations for D+1 to D+14 | |
| - Values from yearly/monthly auction results (KNOWN IN ADVANCE) | |
| - Yearly auction results known for entire year ahead | |
| - Monthly auction results known for month ahead | |
| - **Gold standard future covariate** - 100% certain values | |
| - Directly impacts Max BEX: higher LTN = lower available day-ahead capacity | |
| #### **Border Capacity Adjustments (20 features)** | |
| - `planned_ntc_de_fr_d1`: Day-ahead NTC publication per border | |
| - TSOs publish these D-1 for D-day | |
| ### 11.4 Feature Engineering Workflow | |
| **Input to Chronos 2:** | |
| ```python | |
| # Historical context window (21 days before prediction) | |
| historical_features: np.ndarray # Shape: (512 hours, 1000 features) - UPDATED WITH OUTAGE DURATION | |
| # Future covariates (14 days ahead) | |
| future_features: np.ndarray # Shape: (336 hours, 380 features) - NO CHANGE | |
| # Combine for inference | |
| chronos_input = { | |
| 'context': historical_features, # What happened | |
| 'future_covariates': future_features # What we know will happen | |
| } | |
| # Chronos 2 predicts Max BEX for each border | |
| forecast = pipeline.predict( | |
| context=chronos_input['context'], | |
| future_covariates=chronos_input['future_covariates'], | |
| prediction_length=336 # 14 days Ã 24 hours | |
| ) | |
| ``` | |
| --- | |
| ## Questions & Clarifications | |
| **CRITICAL UPDATES FROM PREVIOUS CONVERSATION:** | |
| 1. **Max BEX Added:** TARGET VARIABLE now included - Day 1 first priority collection | |
| 2. **LTN Added:** Long Term Nominations with future covariate capability (auction results known in advance) | |
| 3. **Min/Max Net Positions Added:** Domain boundaries that define feasible space | |
| 4. **ATC Non-Core Added:** Loop flow drivers from external borders | |
| **Previous Decisions (Still Valid):** | |
| 5. **D2CF Decision:** Confirmed SKIP - not needed for forecasting | |
| 6. **CNEC Count:** Top 200 total (50 full detail + 150 selective detail) | |
| 7. **PTDF Treatment:** Hybrid approach - 600 individual (top 50) + 120 border-aggregated (tier-2) | |
| 8. **Historical Period:** 24 months (Oct 2023 - Sept 2025) | |
| 9. **Missing Data Strategy:** Field-specific (forward-fill, zero-fill, interpolation) | |
| 10. **Outage Matching:** EIC code-based (85-95% exact) + fuzzy fallback (>95% total) | |
| 11. **Tier-2 CNEC Treatment:** Preserve presolved + outage_active individually, aggregate RAM/shadow/PTDF by border | |
| 12. **Total Features:** ~1,735 (1,000 historical + 380 future + 355 aggregates) - HYBRID PTDF + OUTAGE DURATION | |
| 13. **Outage Duration:** Added 3 duration features per top-50 CNEC (elapsed, remaining, total) + border aggregates for tier-2 | |
| **Verification Status:** | |
| - ✅ Max BEX: Confirmed available in JAO Publication Tool ("Max Exchanges" page) | |
| - ✅ LTN: Confirmed available and CAN be used as future covariate (auction results known ahead) | |
| - ✅ Min/Max Net Positions: Confirmed published per hub on JAO | |
| - ✅ ATC Non-Core: Confirmed published for external borders on JAO | |
| --- | |
| **Document Version:** 5.0 - OUTAGE DURATION FEATURES ADDED | |
| **Last Updated:** October 29, 2025 | |
| **Status:** COMPLETE - Ready for Day 1 Implementation with ALL Essential Data Series | |
| **Major Changes:** Added outage duration features (elapsed, remaining, total) for enhanced temporal modeling, ~1,735 total features (+11.6% from v4.0) | |
| **Contact:** Maintain as living document throughout MVP development | |