Reinforcement Learning: Bush Pilots in the Outback

Reinforcement Learning: Perhaps you’ve heard of it? You probably have, reinforcement learning has made a lot of headlines in recent years by mastering Atari games, learning to walk, or even learning to best the top tier chess program in a scant 4 hours.

Consider the shipping industry. There are many unimaginative routes that repeat every day, but in many ways, the shipping problem is stochastic: there are unpredictable shipments that can appear with little notice. Storms can close roads. Trucks break down. Clients cancel contracts. It’s a mad, chaotic world for an industry that employs 2.6 million Americans.

If we can make a real world problem look like a game, reinforcement learning becomes a powerful decision making tool. In this blog, we will have fun using reinforcement learning to schedule bush pilots. Along the way, we will discover that developing a simulation of the environment is equally as rewarding.

Problem Introduction

Consider rugged areas such as Alaska or perhaps the great Australian outback. “Bush pilots” are often employed for moving people, supplies and other necessities from place to place. It's a fun environment to simulate, but we could apply our learnings here to the shipping industry where companies live or die based on how efficiently they commit to and deliver freight.

Due to the stochastic nature of the problem, it becomes a prime candidate for reinforcement learning. Modeling the problem as a Markov process, we apply a standard decision policy:

1) Observe the state of the environment (available planes, freights, weather, etc.)

2) Consider available actions (e.g. accept contract X, go home and rest, wait for a better opportunity, etc.)

3) Estimate the value of each action and chose the most profitable

4) Observe the reward from the environment

In order to estimate the value of each action, we use a neural net with hidden layers, trained via reinforcement learning where the input layer encodes the environment and the output describes values of the possible actions.

For this blog, we deliberately simplify the problem by omitting some factors which we will address in future installments: Let’s have our pilots fly forever. No need for our virtual pilots to spend time with their virtual wives and virtual babies. And let's leave out the weather for now, storms will come in another posting. All this to focus on the core scheduling problem.

Baseline Comparison - a Greedy Scheduler

As a baseline, consider an extremely naïve scheduler: Planes in the fleet always takes the next available job. We ignore any opportunities to improve our delivery rate through assignment swapping. Therefore the scheduler is greedy based on when the shipment became available on pickup, but disregards geospatial information.

In the trivial situation where we only have 1 plane and 1 package, this scheduler is ideal. What could be better? In the real world where we have more planes and more shipping jobs, we expect our agents to outperform this scheduler by leaps and bounds.

Here we see a simulation using our greedy scheduler. At each time step in the simulation, a plane moves towards either a pickup or drop off location - pickup and drop off times are instantaneous. Green lines indicate a shipment waiting to begin, red lines indicate a shipment in progress. When a shipment is delivered, a new shipment is generated and added to the environment.

Simulated over 10,000 time steps, our greedy scheduler tends to deliver 1634 packages with four planes.

Competitive Agents with Reinforcement Learning

Our goal is to improve our scheduling by allowing planes to re-assign pickup and delivery in order to make more deliveries in the same amount of time:

A design point to consider: Traditionally a schedule optimizer will take in the full environment and attempt to create complete schedule in a single pass. This approach is computationally intensive; let’s try something different.

For a more efficient, scalable solution lets consider a model which schedules a single pilot, then use this model to schedule the fleet by running many different forward passes in parallel.

Perhaps with a single pilot model we may lose out on some opportunities where altruism and cooperation will improve our bottom line - but I’m not a believer. Instead, we can fully embrace capitalism - with every pilot making the best possible decision for themselves as individuals, the overall collection of pilots should benefit.

In fact, we should approach something of a Nash Equilibrium - Each pilot takes the best possible strategy for themselves which includes knowledge of the other pilot’s strategy. Any divergence from this strategy decreased the pilot’s gains in the game. If no pilot can make a better individual choice, how far from an optimal solution are we? I’ll not make the claim that we have found the optimal solution, I’ll simply say that the solution is “probably good”.

For training, we focus on minimizing the time to pickup a package - a simple metric for a simple simulation. After training our model in our environment, we can see our new policy in action:

Comparison of Schedulers

Recall that the “greedy” scheduler delivered 1634 packages in 10,000 time steps.

Our competitive agents delivered 1772.

That’s an 8.3% improvement - In the shipping industry, companies would die for an 8.3% improvement, but this is admittedly a simple example. While we will see far more complex examples in future installments, it’s worth seeing what these agents are really capable of. The’ve learned far more than just a little efficiency gain and we didn’t have to explicitly teach them how to do it.

Learned Behavior

No machine learning blog is complete without a few observations of learned behavior. Our agents learned a few tricks in the process which weren’t necessarily expected.

For example, we did not make “wait” an available option, and yet we can clearly see this behavior in action. In situations where a pilot cannot compete for available freights, it may wait in place for a better opportunity to present itself. Consider this moment, 89 steps into the simulation. The indicated plane notices that 1) it has no chance to reach any of the three nearby available packages before planes A or B or C. 2) It also recognizes that plane B will be in a superior position to pick up the only available package even after making a delivery. 3) The plane’s current position is strong strategically. There is a good possibility our plane will be in prime position to move to the next randomly spawned pickup location. It’s worth noting that the network learned to wait by exploiting our coding behavior - when a plane is assigned to a package currently being delivered, the plane stays in place.

Furthermore, we did not encode a “reposition” option, yet here again our pilots found a way to exploit our pathing implementation to move towards contracts - contracts which the pilot has absolutely no chance of fulfilling - in order to move into a better position on the board (closer to the center but further from competing pilots). We see an example of this 284 steps into the simulation. Here, our plane doesn’t really have a chance of reaching a pickup location before any other plane, but perhaps thinks it has better position by moving towards the center of the map. Or perhaps perhaps a new pickup will tempt the other competing plane away. Either way, our best move is to move towards an unobtainable pickup.

Stay Tuned

We have developed an agent for predicting the best possible move in a given simulation. This gives us one component.

We could, instead of randomly spawning pickup/dropoff locations, seed the simulation with real world data, then use the result of the simulation to post a week long schedule.

Any modern chess AI (including AlphaZero) will look beyond the next move to make a more informed decision, typically using A/B pruning to narrow down the search space. With more predictable shipping windows known to the environment, our scheduling would also benefit from looking further into the future.

We could then use our simulation in real time to handle unexpected events such as weather, breakdowns, sour clients or high value urgent requests to dynamically re-route a posted schedule.

The simulation itself is worthy of further discussion. In a future blog, we will demonstrate how we may add further parameters to make our simulation more realistic, and how a separate machine learning paradigm could learn from historical data to make our simulation even more accurate.

We’ve shown that a set of competitive agents trained with machine learning can yield insightful solutions to a difficult, unpredictable problem.

What difficult, unpredictable problems do you have? Send Ryan an email!