Welcome to my blog

I’m an AI research engineer working on generative models for sequential data. This blog is where I crystallize my thinking through technical writing. I write mostly on AI and programming, with the occasional detour into whatever else I’m trying to understand.

Understanding SFT dynamics around the best pre-RL checkpoint

A recent study showed that when training reasoning models, the best SFT checkpoint for downstream RL can occur well before the checkpoint with the strongest SFT performance, with validation loss emerging as a surprisingly effective predictor of post-RL performance. In this post, I reproduce this observation on a much smaller language model and examine how the model evolves around the optimal pre-RL checkpoint. Along the way, I explore several training dynamics that may help build intuition for why validation loss within a training run has the capacity to predict downstream RL performance. Github link coming soon. ...

Diffusion for sketch-guided trajectory simulation

Trajectory forecasting models aim to predict the future positions of agents given past observations. However, many real-world applications such as sports analytics require not just prediction, but controllable simulation of plausible futures under hypothetical scenarios. In this post, I investigate diffusion-based models for trajectory generation, and show how high-level “sketches” of plays can guide multi-agent dynamics. I have open sourced the code base for the model and canvas app. Sketched trajectory conditioning Simulated gameplay 1. The idea of sketch-guided trajectory simulation During NBA games, coaches often have to plan attacking plays to break through a defensive setup within a short period of time. They may sketch out instructions for a few players on a whiteboard, and rely on their mental model to project how both teammates and opponents might move in response. Prior work has explored data-driven approaches for this problem, using imitation learning1 and GANs2 to generate sketch-conditioned gameplay simulations. In this post, I explore whether diffusion models3 4 can be used for this task as well. If you are unfamiliar with diffusion, please check out this post. ...

Understanding diffusion through VAE

This post is an introduction on diffusion written for my younger self. If you are new to diffusion and already familiar with VAE, this may be a good entry point for you. Diffusion has blown up over the past few years, with a huge amount of literature produced, but many differ in how they frame this modeling paradigm1. I do not intend to approach this via the most standardized or unified theory. I am not the best person to do that, nor do I think that angle helped me the most when approaching the subject (see my previous post). While many explanations derive diffusion from first principles, I found it useful to start by understanding how the objective is shaped into something we can actually train. ...

Ranting about the engineering science of ML

This post houses my continuously developing view about ML research. I may update it from time to time One of my professors once said: “Rather than pure science, machine learning is really an engineering science.” I feel like I understand this statement more and more as time passes. There are several angles from which I’ve found it resonating with my research experience: When building ML models, while mathematical theory defines the ceiling, engineering determines your baseline. In principle, the theory of diffusion + universal function approximation should allow us to model many arbitrary distributions with a large enough model. In practice, training dynamics are shaped by your design choices: data preprocessing, noise schedule, model architecture. All of these play a role and form the structural bias that supports that theoretical outcome. Small deviations from theoretical correctness in the model or loss often still lead to workable solutions (e.g. variational lower bound term weighting in diffusion training), but a difficult data distribution due to under-invested preprocessing design can completely break training. Theory often lags behind practice in this field. Many successful methods were discovered empirically first, and only later understood in terms of a governing theory. As a researcher, you often have to act on an empirical basis. Batch Normalization is a good example: it was widely adopted before a stable theoretical explanation emerged, and even today it is explained through multiple incomplete perspectives. When learning a new modelling concept, I find that I develop my understanding more efficiently by starting with the intuitions behind the initial engineering choices and their iteration chronology, rather than jumping straight into the latest, more distilled, unified theory. The former helps me see the rationale and design space considered by the original authors, and build stronger intuitions about which components are fundamental to training dynamics. Again, I have to mention my journey in learning diffusion models. While there are well-written resources for understanding diffusion through a unified stochastic differential equation lens, I found myself understanding the subject better after going through it in the order of latent variable model framing (DDPM, variational diffusion models), score models, and finally the SDE work that unifies these views. Ok, enough ranting. Time to go back to figuring out why my loss refuses to go down… ...

What I talk about when I talk about asyncio

This piece captures what I’ve learned while going down the rabbit hole to understand how asyncio works in Python. While a few others had written on the subject, I think it is still valuable to contribute another angle of introduction, as learners with different backgrounds resonate better with different motivations and styles of explanation. This is one I’ve written for my one-year-younger self. Why this matters As an AI researcher, I spent most of my career focused on models and data. When it came to optimizing runtime, whether for data processing or model inference, I typically relied on batching or multiprocessing. I had very little exposure to asynchronous execution, and I didn’t think I needed it. That changed when I stepped into the world of scaffolded LLM systems. ...