Week 35, 2025 : Multi-Arm Bandit Blues

Week 35, 2025

“Yun toh humne kya kiya lekin …
ye kiya hai ke din guzaare hai – Jaun Elia

This week hasn’t been very productive but not wasterful either. All my priority work items below got distracted attention from me. I guess this is just part of the process between start and end when it’s all ungratifying but you keep moving on.

Work ★ ★ ★ ★ ☆

I went into the Manhattan office once this week. It was a very productive Monday. That gave me the necessary momentum to keep going for the next few days. I also started gathering data to pitch the Reinforcement Learning project to my manager. It seems promising. I have seen a lot of “exploitation” (RL term) of knowledge of offer and customer behaviors and less “exploration”. As soon as I saw these results, I got yippy, I almost sent an email to my manager but held back. I want to further strengthen my case with a possible PRD (Product Requirments Document). I have some hopes that this may just work.

Projects ★ ★ ★ ☆ ☆

Reinforcement Learning

I built a Multi-Arm Bandit (context-free) recommender system on the MovieLens dataset. Here’s the code for what I did. The dataset size was 100K samples/rows. The idea in a Multi-Arm Bandit algorithm is that (each “Arm” is a movie - top 50) we recommend movies using some statistical formula to every row (user) of the dataset. We then compare the recommended movie, and create a column to convert ratings into a binary reward. If the movie (that is recommended by the Policy) is rated by the customer then the reward is either 0 or 1 based on the rating (reward = 1 if the rating > 4). The cumulative reward gets updated everytime there’s a match between recommended movie and rated movie in the dataset.

Now the statistical formula to recommend movie (in my example) comes from Thomspon-Sampling (TS) Policy. TS maintains a beta-distribution per movie (success/failure counts). TS starts with a random sampling, then updates the formula (success / failure) for each movie as we go through rounds (rows of the dataset). I also used a baseline of Random Sampling Policy - to compare the TS results. I was planning on building a supervised ML recommendation system to compare the performance but when I saw that RL’s performance is even worse than random baseline, I dropped the idea.

For results, TS Policy gave a cumulative reward of 0.0068 vs Randowm Sampling Policy’s cumulative reward of 0.2629, this means a 97% drop in performance LOL. To be honest, it is misleading to compare the cumulative rewards from random sampling and TS because the random sampling updates cumulative reward at every round whether there’s a match or not (sum(rewards) / number_of_interactions). TS, however, only updates the cumulative rewards when there’s a match.

This highlights the offline RL shortcomings, I am running the Bandit algorithm offline against a fixed dataset, not an interactive environment. Meaning that we never know the reward for actions that the user never took (counterfactual problem). In an online world, I could have showed a particular user 10 movies and calculated some reward based on clicks, played time, ratings etc. But in an offline dataset (or also called as Logged dataset), I have no idea of what the counterfactual is. We see the same problem with our offer engine when we want to calculate incremental margin and incremental trips - how can we tell if a customer was going to make a trip to CVS with or without the offer? We have come up with some nuanced logics to deal with it for now. But an Online RL running in market would be the right way to deal with the counterfactual problem I believe. I could be wrong.

My little project has been trying to judge a basketball player only on the shots they didn’t take. Offline-Bandit on a small dataset is a failure.

Books ★ ★ ★ ☆ ☆

Ready Player One - Ernest Cline

I didn’t read much this week. Maybe only 20 pages. But the book is sure getting interesting. The premise for the book is year 2044. Having demo’d the Apple Vision Pro headset and watching Meta dedicate time and resources to the Metaverse, I believe the story in the book could very much be a reality by 2044. I will share more insights on the book when I go deeper into the story.

Fitness ★ ★ ☆ ☆ ☆

I went to the gym only 2 times this week. I walked about 8000 steps daily on average - for someone not on a desk job, this may feel very less. But trust me that if I don’t try, I struggle to even get 1000 steps in. That’s just the world we live in now!

I made many blunders in nutrition as well. Me and Kiran were invited to 2 Ganpati festival celebrations this week and I ate a lot of fried and sweet food. The host even packed us some to eat the next day.

#Weekly-Updates #Reinforcement-Learning #Rl-Level1

Unmesh Mali