Week 33, 2025
This is a blog of complaints. Complaints against myself. Complaints of not taking control of the things I do. Complaints of letting myself drift through life out of focus. I didn’t get a single workout this week. Many social commitments and my addition to FIFA. I don’t even know how many hours I spent playing the game. It must be a lot. Even though I blame social commitments for my lack of progress throughout the week, I know the reason to be the time spent on FIFA and social media. Need to fight these addictions.
This is what I wrote on the Saturday afternoon sitting at a Cafe. Tired of my lack of discipline. However, with that realization, I unplugged my PS5 and removed Instagram from my phone when I came home. 36 hours later I made decent progress with all that I needed to do.
Work ★ ★ ★ ☆ ☆
Worked on one data pipeline project and one MLOps project this week. Nothing exciting to share. My boss took sudden interest in our ML infrastructure this week. I was his go-to resource to understand all the nuances and gaps in our workflow. It felt good to be needed.
Projects ★ ★ ★ ★ ☆
Reinforcement Learning
I spent a lot of time reading and asking theoretical questions about building RL-based Recommender Systems. The reason for this question is that I keep drawing parallels to my current work with Offer Engine at CVS. Our entire offer engine uses Supervised ML and optimization to send offers within a specified budget. I want to understand if and how RL can replace the offer engine (I have a strong intuition that given time RL may do a better job). My initial impressions of RL are mostly agents playing games in simulated environments. So I had no idea how RL applies in the context of historical data. My mind immediately went towards historical data of customer interaction with CVS Retail is because I seriously doubt that the business team will let me run RL algorithms online on a test cell so that the agent learns a policy in real-time. Hence my work on RL, if I have to present to the business team at CVS, would have to be a simulation on historical data.
This is where the problems with RL begin. RL is not trained in the same way that a Supervise Machine Learning algorithm is trained (which is what I am used to). In Supervised ML, we only see the rewards or labels (redemption of offers or probability of redemption) only on action taken (offer sent) and NOT on all available actions. This question led me down the path of offline Reinforcement Learning concepts and Bandit-based algorithms. With historical data, we don’t know the label for unchosen action. That is a counterfactual problem. (Side note - while building the offer engine at CVS we faced issues with counterfactual data that took a lot of brainstorming and adjustments to get right. And the counterfactual problem will be an ongoing problem for the near-future.)
Training an RL agent on logged data (or historical data) has a problem with Exploration vs Exploitation
. When running RL online, with algorithms like epsilon-Greedy, we can adjust the value of epsilon to decide how much to explore. However, with logged data, the knowledge is stuck inside the “shadow” of what the past policy (e.g. just sending offers randomly) did.
Now the question was how do you train a RL on historical (or logged) data. My conversations with Deepseek, ChatGPT, and Perplexity all pointed to a few options -
- Contextual Bandit method - one of the ways the RL agent learns is using something called
Inverse Propensity Scoring (IPS)
. I have a basic understanding of IPS but I’ll leave the detail writing of that for the next blog. - Offline RL - This method assumes your dataset has user sessions (sequeces of interaction) which allows us to treat this algorithm as a
Markow Decision Process (MDP)
For the next week, I’ll try to build a recommender system on some open-source dataset and play with the above algorithms (and more) to build intuition even further.
AI has made learning very easy assuming that you have the curiosity to quench. And it is much cheaper (if not free) than most of the bootcamps. Learning from humans may just be a thing of the past. Let’s see.
Books ★ ★ ☆ ☆ ☆
✔️ The Forever War - Joe Haldeman
I finished this book over the weekend. It has some good perspective-shifting ideas but nothing ground-breaking. It’s probably because it requires me to stretch my imagination even further than I could—time travel, completely upended social structures, speed-of-light travel, unimaginable combats, etc.
The book could have done a better job of gradually immersing me in that imaginary world, but alas, it is rushed, and my ever-shortening attention span cannot meet the author even halfway there.
Fitness ★ ★ ☆ ☆ ☆
- Fast: I fasted 4 days this week. Mostly fasting for over 20 hours each of those days. It felt very easy. The only thing I would change is to control my portions during the 2-4 hours eating window. On Friday, I ate a lot of food pretty late in the day and felt uneasy throughout the night.
- Gym: Not even a single workout. This was probably the source of my discomfort and low energy all week.
Having and maintaining an active social life does come at a cost. I need to learn to at least put my fitness above it. I need to learn to say no.