# Reinforcement Learning

## What is Reinforcement Learning?

**Reinforcement Learning (RL)** is a learning method in which an **agent** through *trial and error* learns to make optimal decisions. The agent tries out various actions in its environment and receives a **reward** in return, adapting its strategy until it achieves the highest total reward in the long run.

In the context of building automation, this means: an RL system independently tests control commands for HVAC, lighting, storage, and more, continuously evaluates their effects on energy consumption and comfort—and improves its control step by step without any manual intervention.

***

## Typical application areas

<table data-full-width="true"><thead><tr><th width="198.60003662109375">Domain</th><th>Goals</th><th>Examples</th></tr></thead><tbody><tr><td><strong>Building automation</strong></td><td>Reduce energy and CO₂ costs, maintain comfort range</td><td>HVAC schedules, peak load smoothing</td></tr><tr><td><strong>Energy &#x26; Smart Grid</strong></td><td>Shift load, control storage</td><td>PV storage dispatch, tariff adjustment</td></tr><tr><td><strong>Robotics</strong></td><td>Dexterous movements</td><td>Grasping, navigation, drone flight</td></tr><tr><td><strong>Industry 4.0</strong></td><td>Increase throughput, reduce scrap</td><td>Dynamic pacing of production lines</td></tr><tr><td><strong>Games &#x26; Simulation</strong></td><td>Strategy finding</td><td>AlphaGo, complex 3-D games</td></tr></tbody></table>

***

## Current challenges of Reinforcement Learning

Already in the much-cited blog post from 2018 [“**Deep Reinforcement Learning Doesn’t Work Yet**“ (February 2018)](https://www.alexirpan.com/2018/02/14/rl-hard.html) it is shown where RL fails in practice—and despite all progress, these hurdles remain largely unresolved to this day:

1. **High data requirements**\
   Many RL algorithms need millions of interactions—difficult to implement in real systems.
2. **Sensitive hyperparameter tuning**\
   Learning rate, network architecture, and the like are often experimental—small changes decide success or failure.
3. **Reward design & “reward hacking”**\
   Misleading rewards can lead to completely undesirable behavior.
4. **Exploration vs. exploitation & local optima**\
   Too little exploration ends in suboptimal but easily reachable solutions.
5. **Instability and reproducibility**\
   Same code, different random seed → sometimes completely different results.
6. **Weak generalization**\
   Models are often trained on *one* environment; even small changes cause performance to collapse.
7. **Safety and compliance issues**\
   Autonomous agents must be limited and auditable to avoid risks in critical infrastructure.

***

### Data scarcity – the biggest hurdle in smart buildings

As if hyperparameter tuning, reward design, and instability weren't challenge enough, reinforcement learning in building operations faces an additional core problem: **too little raw data**.\
Typical BMS or heat pump systems provide measurements **at 5-minute intervals**. At one step every 5 minutes, that adds up to only around **100,000 timestamps** in a year—orders of magnitude away from the **millions of interactions**that classic RL algorithms require for robust policies.

***

### **Approaches to fill the data gap**

<table data-full-width="true"><thead><tr><th width="221">Approach</th><th>Idea</th><th>Pros and cons</th></tr></thead><tbody><tr><td><strong>High-fidelity digital twin</strong></td><td>A complete simulation model of the building (e.g. EnergyPlus) that delivers exact thermal responses down to wall assemblies, window glazing, occupancy, and weather. RL gathers its millions of steps <strong>in simulation</strong>.</td><td>+ Physically grounded<br>+ No live risks<br>– BIM creation &#x26; calibration are labor-intensive<br>– Computational load for long rollouts</td></tr><tr><td><strong>Model-based RL / world models</strong></td><td>Instead of heavy physics, one uses a <strong>learning-based world model</strong>that extracts a differentiable mini-world from the 100,000 log samples. There, the agent can experience millions of “dreamed” steps per GPU minute.</td><td><p>+ Extremely data-efficient (often &#x3C; 10,000 real steps to benefit)<br>+ Continuous online fine-tuning</p><p>+ Ready to use immediately if pretrained on data<br>– Model only learns from areas of the data it has already experienced</p></td></tr><tr><td>Offline RL &#x26; imitation learning</td><td>RL learns <strong>exclusively</strong> from the available logs (BCQ, CQL …), so it first learns to imitate them and then improve from there, but never overwrites the policy with actions that do not appear in the data store.</td><td>+ No twin needed<br>+ Ready to use immediately<br>– Quality depends directly on log diversity</td></tr></tbody></table>

***

## How Eliona overcomes all RL hurdles: world model + simulation steps

At Eliona, we pursue a **world-based RL approach**: A neural world model learns the building’s dynamics from historical and live data as needed—including weather, occupancy, or window openings. An RL agent then trains its control strategy via simulation (“dream steps”), tests thousands of actions per real step, and thus learns in a data-efficient, robust, and safe way.

<figure><img src="https://content.gitbook.com/content/Nyvwhz1kEMXcHf4HLuZ8/blobs/zjhnnQgWHR5c8v39f1s8/image.png" alt=""><figcaption></figcaption></figure>

### **High data requirements**

**Problem:** Classic RL requires millions of real interactions—unattainable with 5-minute intervals and \~100,000 timestamps per year.\
**Solution:**

* A pretrained world model absorbs the 100,000 historical samples and turns them into a mini-world in which the agent can simulate indefinitely.
* Tests show clear savings after just ≈ 2,000 real steps.
* Offline pretraining + millions of simulation steps make it possible to start with a directly usable model—without any live training phase.

### **Sensitive hyperparameter tuning**

**Problem:** Learning rates, network architectures, and regularizers otherwise require tedious grid search and expert knowledge.\
**Solution:**

* Our world-based system has been calibrated on dozens of RL problems.
* A robust default parameter set delivers immediately reproducible performance—without any additional tuning.

### **Reward design & “reward hacking”**

**Problem:** Poorly defined rewards lead to undesirable strategies or exploit behavior.\
**Solution:**

* Users now define only target ranges via the GUI (e.g. 21–23 °C) and metrics that should be minimized or maximized (costs, CO₂, peak load).
* In the background, Eliona generates a proven reward function tailored to the system structure.
* This keeps the reward understandable, safe, and free of perverse incentives.

### **Exploration vs. exploitation & local optima**

**Problem:** Exploiting known actions too early blocks the discovery of global optima.\
**Solution:**

* Broad exploration of all strategies takes place risk-free in the mini-world.
* In the real world, only the currently best strategy is applied.
* Long-term tests consistently show convergence to optimal operating modes rather than suboptimal plateaus.

### **Instability and reproducibility**

**Problem:** Models that vary greatly depending on the random seed or training run are unreliable in real operations.\
**Solution:**

* HVAC benchmarks document a narrowly limited learning horizon of **1,000–2,000 real steps** —regardless of seed.
* Results are predictable and ROI estimates reliable.

### **Weak generalization**

**Problem:** Models trained only on summer or test data fail when real operating conditions change.\
**Solution:**

* Continuous online fine-tuning: Newly arriving 5-minute data update the world model and thus the control strategy.
* The system adapts within **a few days** to new seasons, renovations, or tariff changes.

### **Safety and compliance issues**

**Problem:** Autonomous actions without control mechanisms can exceed comfort or safety limits.\
**Solution:**

* Through the integrated rule-chain engine, **hard comfort and safety thresholds** can be defined.
* If limit values are violated or unexpected actions occur, Eliona automatically switches to a proven fallback controller and triggers alarm escalation.

Thanks to this approach, Eliona achieves with RL **double-digit energy and cost savings**, while comfort and safety meet the highest standards—and all without years of data collection or expert fine-tuning.

***

## **Simple commissioning via the RL app**

Configuration in Eliona is intentionally designed for minimal effort—without any code:

1. **Select assets**\
   Choose the building parts, systems, or sensor groups in the RL app interface that should provide feedback (e.g. room air conditioners, heat pump, window contacts).
2. **Define controllable attributes**\
   Mark which actuators the agent is allowed to control (e.g. supply temperature, fan speed, throttle valve).
3. **Simple goal definition**\
   For each attribute, specify whether it should be kept within a range (e.g. 21–23 °C), minimized (costs, CO₂), or maximized (COP, self-consumption share)—or whether a dynamic or exact setpoint should apply.
4. **Start and observe**\
   The system automatically retrieves the latest historical data, builds the world model, and immediately begins offline training. After that, the agent can be switched live at any time—with a click.

From that point on, it learns fully automatically: first offline from history, then on the fly with every incoming data point.

***

## Three proven RL applications in building operations

Recent studies and field trials show that reinforcement learning already makes double-digit energy savings and comfort gains possible today—without years of data collection. Below are three solid examples in which RL systems were applied to real buildings, including configuration overview, achieved results, and source references.

### 1. Office building HVAC: 12% energy and 28% comfort improvement

**Scenario & objective:**\
A multi-story office building uses deep RL to optimize heating, ventilation, and air-conditioning operation. The goal is to reduce energy consumption by double-digit percentages and minimize comfort violations (temperature fluctuations outside the target range).

**RL app configuration (example):**

* **Data inputs:** Indoor temperature, outdoor temperature, occupancy levels, window contacts, historical HVAC actuator setpoints
* **Controllable actuators:** Supply temperature, fan levels
* **Reward definition:** Keep temperature in \[21 °C, 23 °C] *and*energy consumption *minimize.*

**Result:**\
In simulation and in the subsequent field test, the RL system achieved **12% lower energy consumption** and **28% fewer comfort violations** compared with PID and schedule-based control [ScienceDirect](https://www.sciencedirect.com/science/article/abs/pii/S0360544224001154?utm_source=chatgpt.com).

### 2. Manhattan high-rise: 15.8% less HVAC energy

**Scenario & objective:**\
A 32-story office tower in New York City uses a commercial, AI-powered RL system to reduce heating and cooling costs.

**RL app configuration (example):**

* **Data inputs:** Building and zone temperatures, outdoor temperature, occupancy data, real-time tariffs&#x20;
* **Controllable actuators:** Heating/cooling circuits, fan control
* **Reward definition:** Minimization of total energy draw, compliance with comfort bands

**Result:**\
The AI reduced HVAC energy consumption by **15,8 %**, saved around 37 t of CO₂ and $42,000 annually—all fully automated and without intervention from building management [TIME](https://time.com/7201501/ai-buildings-energy-efficiency/?utm_source=chatgpt.com).

### 3. DFAB House (Empa): Up to 30% energy savings

**Scenario & objective:**\
At the DFAB House research building (Empa, Switzerland), an RL agent was trained to jointly optimize room temperature and bidirectional EV charging.

**RL app configuration (example):**

* **Data inputs:** Room temperature, outdoor weather data, PV generation, EV SoC, electricity tariff
* **Controllable actuators:** Radiator setpoint, charging station power
* **Reward definition:** Maximize comfort score + PV self-consumption, minimize grid import costs

**Result:**\
In the real three-week field test during the heating season, the RL system achieved **up to 30% energy savings** compared with conventional control strategies, while maintaining the same comfort level [arXiv](https://arxiv.org/abs/2103.01886?utm_source=chatgpt.com).

### **Conclusion:**

These scenarios show that RL-based applications are ready for use today in a wide range of building types and operating modes. With simple configuration steps in the Eliona RL app, similar results can be achieved in just a few clicks—from office complexes and high-rises to intelligent research buildings.
