Ranting about the engineering science of ML
This post houses my continuously developing view about ML research. I may update it from time to time One of my professors once said: “Rather than pure science, machine learning is really an engineering science.” I feel like I understand this statement more and more as time passes. There are several angles from which I’ve found it resonating with my research experience: When building ML models, while mathematical theory defines the ceiling, engineering determines your baseline. In principle, the theory of diffusion + universal function approximation should allow us to model many arbitrary distributions with a large enough model. In practice, training dynamics are shaped by your design choices: data preprocessing, noise schedule, model architecture. All of these play a role and form the structural bias that supports that theoretical outcome. Small deviations from theoretical correctness in the model or loss often still lead to workable solutions (e.g. variational lower bound term weighting in diffusion training), but a difficult data distribution due to under-invested preprocessing design can completely break training. Theory often lags behind practice in this field. Many successful methods were discovered empirically first, and only later understood in terms of a governing theory. As a researcher, you often have to act on an empirical basis. Batch Normalization is a good example: it was widely adopted before a stable theoretical explanation emerged, and even today it is explained through multiple incomplete perspectives. When learning a new modelling concept, I find that I develop my understanding more efficiently by starting with the intuitions behind the initial engineering choices and their iteration chronology, rather than jumping straight into the latest, more distilled, unified theory. The former helps me see the rationale and design space considered by the original authors, and build stronger intuitions about which components are fundamental to training dynamics. Again, I have to mention my journey in learning diffusion models. While there are well-written resources for understanding diffusion through a unified stochastic differential equation lens, I found myself understanding the subject better after going through it in the order of discrete latent variable model framing (DDPM, variational diffusion models), score models, and finally the SDE work that unifies these views. Ok, enough ranting. Time to go back to figuring out why my loss refuses to go down… ...
What I talk about when I talk about asyncio
This piece captures what I’ve learned while going down the rabbit hole to understand how asyncio works in Python. While a few others had written on the subject, I think it is still valuable to contribute another angle of introduction, as learners with different backgrounds resonate better with different motivations and styles of explanation. This is one I’ve written for my one-year-younger self. Why this matters As an AI researcher, I spent most of my career focused on models and data. When it came to optimizing runtime, whether for data processing or model inference, I typically relied on batching or multiprocessing. I had very little exposure to asynchronous execution, and I didn’t think I needed it. That changed when I stepped into the world of scaffolded LLM systems. ...