Dynamic Pricing with Multi-Armed Bandit: Studying by Doing | by Massimiliano Costacurta | Aug, 2023

Making use of Reinforcement Studying methods to real-world use instances, particularly in dynamic pricing, can reveal many surprises

Within the huge world of decision-making issues, one dilemma is especially owned by Reinforcement Studying methods: exploration versus exploitation. Think about strolling right into a on line casino with rows of slot machines (often known as “one-armed bandits”) the place every machine pays out a unique, unknown reward. Do you discover and play every machine to find which one has the best payout, or do you stick to at least one machine, hoping it’s the jackpot? This metaphorical situation underpins the idea of the Multi-armed Bandit (MAB) downside. The target is to discover a technique that maximizes the rewards over a sequence of performs. Whereas exploration affords new insights, exploitation leverages the data you already possess.

Now, transpose this precept to dynamic pricing in a retail situation. Suppose you might be an e-commerce retailer proprietor with a brand new product. You aren’t sure about its optimum promoting worth. How do you set a worth that maximizes your income? Must you discover completely different costs to know buyer willingness to pay, or must you exploit a worth that has been performing nicely traditionally? Dynamic pricing is basically a MAB downside in disguise. At every time step, each candidate worth level might be seen as an “arm” of a slot machine and the income generated from that worth is its “reward.” One other solution to see that is that the target of dynamic pricing is to swiftly and precisely measure how a buyer base’s demand reacts to various worth factors. In less complicated phrases, the intention is to pinpoint the demand curve that finest mirrors buyer conduct.

On this article, we’ll discover 4 Multi-armed Bandit algorithms to guage their efficacy in opposition to a well-defined (although not easy) demand curve. We’ll then dissect the first strengths and limitations of every algorithm and delve into the important thing metrics which are instrumental in gauging their efficiency.

Historically, demand curves in economics describe the connection between the worth of a product and the amount of the product that buyers are prepared to purchase. They typically slope downwards, representing the frequent statement that as worth rises, demand usually falls, and vice-versa. Consider widespread merchandise corresponding to smartphones or live performance tickets. If costs are lowered, extra folks have a tendency to purchase, but when costs skyrocket, even the ardent followers may assume twice.

But in our context, we’ll mannequin the demand curve barely in another way: we’re placing worth in opposition to likelihood. Why? As a result of in dynamic pricing situations, particularly digital items or providers, it’s typically extra significant to assume when it comes to the probability of a sale at a given worth than to invest on actual portions. In such environments, every pricing try might be seen as an exploration of the probability of success (or buy), which might be simply modeled as a Bernoulli random variable with a likelihood p relying on a given take a look at worth.

Right here’s the place it will get significantly attention-grabbing: whereas intuitively one may assume the duty of our Multi-armed Bandit algorithms is to unearth that ideally suited worth the place the likelihood of buy is highest, it’s not fairly so easy. In truth, our final objective is to maximise the income (or the margin). This implies we’re not looking for the worth that will get the most individuals to click on ‘purchase’ — we’re looking for the worth that, when multiplied by its related buy likelihood, offers the best anticipated return. Think about setting a excessive worth that fewer folks purchase, however every sale generates important income. On the flip facet, a really low worth may appeal to extra patrons, however the whole income may nonetheless be decrease than the excessive worth situation. So, in our context, speaking in regards to the ‘demand curve’ is considerably unconventional, as our goal curve will primarily signify the likelihood of buy somewhat than the demand immediately.

Now, attending to the mathematics, let’s begin by saying that shopper conduct, particularly when coping with worth sensitivity, isn’t all the time linear. A linear mannequin may recommend that for each incremental enhance in worth, there’s a continuing decrement in demand. In actuality, this relationship is usually extra complicated and nonlinear. One solution to mannequin this conduct is through the use of logistic features, which may seize this nuanced relationship extra successfully. Our chosen mannequin for the demand curve is then:

Right here, a denotes the utmost achievable likelihood of buy, whereas b modulates the sensitivity of the demand curve in opposition to worth adjustments. The next worth of b means a steeper curve, approaching extra quickly to decrease buy chances as the worth will increase.

4 examples of demand curves with completely different mixtures of parameters a and b

For any given worth level, we’ll be then capable of get hold of an related buy likelihood, p. We are able to then enter p right into a Bernoulli random variable generator to simulate the response of a buyer to a specific worth proposal. In different phrases, given a worth, we will simply emulate our reward operate.

Subsequent, we will multiply this operate by the worth with the intention to get the anticipated income for a given worth level:

Unsurprisingly, this operate doesn’t attain its most in correspondence with the best likelihood. Additionally, the worth related to the utmost doesn’t rely on the worth of the parameter a, whereas the utmost anticipated return does.

Anticipated income curves with associated maxima

With some recollection from calculus, we will additionally derive the formulation for the by-product (you’ll want to make use of a mix of each the product and the chain rule). It’s not precisely a soothing train, however it’s nothing too difficult. Right here is the analytical expression of the by-product of the anticipated income:

This by-product permits us to search out the precise worth that maximizes our anticipated income curve. In different phrases, through the use of this particular formulation in tandem with some numerical algorithms, we will simply decide the worth that units it to 0. This, in flip, is the worth that maximizes the anticipated income.

And that is precisely what we want, since by fixing the values of a and b, we’ll instantly know the goal worth that our bandits must discover. Coding this in Python is a matter of some strains of code: