In this post we will consider portfolio management using reinforcement learning. Reinforcement learning is a method of machine learning, during which the testing system (agent) learns by affecting and interacting with its environment.

From the perspective of portfolio management, a trading agent is given raw financial data. From here, the task for the agent is simply learn the best strategy to maximise expected rewards (profits), except, in direct contrast with supervised learning, the algorithm is not provided direct decision data but the algorithm has to learn from the environment whether the current actions are good or bad. Instead, the reinforcement algorithm receives a signal from the environment of whether current actions are good or bad.

## The key features of reinforcement learning:

- Reinforcement learning finds a compromise through trial and error between exploring unknown areas and the application of existing knowledge.
- RL explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment.
- All RL agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments.
- Moreover, the agent typically has to operate in spite of significant uncertainty about the environment it faces.

A variety of different problems can be solved using Reinforcement Learning. Since RL agents can learn without direct (supervised) guidance, we can use them to solve problems that have no obvious or easily programmable solutions but that do have a general concept of cost/reward. One can think of this as training a dog, the agent (in this case the dog) has to learn what it is doing correctly and incorrectly from the environment, its actions and the feedback it receives (e.g. "good dog!", "bad dog..." etc). This is slightly different to supervised algorithms in that no *direct* information is available.

A common reward/value function in modern portfolio theory is the Sharp Ratio, which is a measure of risk adjusted return. The algorithmic agent learns to maximise the value function in its uncertain and changing environment, learning by itself what actions to take - be it to explore certain strategies or using its current knowledge - to accomplish this.

## Two of the main problems solved with the help of RL:

**Game playing**: Determining the best moves to make in a potentially huge game/environment. To cover this many states using a standard rule-based approach would mean specifying an also large number of hard coded rules. RL removes the need to manually specify rules, agents learn simply by playing in the game.**Control problems**: Typically simpler problems such as elevator or traffic-light scheduling - modeling finite state machines. Again, it is not immediately obvious which strategies might provide the best, most timely elevator service for the most people. For control problems such as this, RL agents can be left to learn in a simulated environment where they will come up with optimal controlling policies.

Some advantages of using RL for control problems is that an agent can be retrained easily to adapt to environment changes, and trained continuously while the system is online, improving performance with live data.

In the simplest case, and conditions and actions can be discrete where you can keep a table of counts for each state and . Other methods that are used in reinforcement learning:

- Adaptive Heuristic Critic,
- SARSA,
- Q-learning.

Let’s take a look at one of the recently developed trading systems - Recurrent Reinforcement Learning. Using a set of input data, we develop a trading system that learns profit and losses, through interactions with the market over time.

Similar to typical Reinforcement Learning, RRL can be divided into policy search and value search algorithms. In both cases, we compute the discounted future reward of a policy and using Bellman’s equation can iteratively solve for the optimal policy and value.

In particular, during the 1960’s the Capital Asset Pricing Model (CAPM) was established, which concluded that there is only one optimal portfolio that can achieve the lowest level of risk for any level of return. This “market portfolio” is the weighted sum of all risky assets within the financial market, totally diversified for risk.

RRL algorithms such as this are therefore geared towards learning the optimal trading strategy to weight your assets within each time step to find an optimal portfolio of risky and riskless (e.g. cash) assets.

Final Thoughts

We introduced the concept of Reinforcement Learning and discussed how such an algorithm can be used to solve portfolio optimizations to extract the most value from raw market data.

Over time, such algorithms learn the strategies to deploy that will generate the maximum value (typically as a function of the Sharp ratio) and consequently the maximum profit.