The AI world moves fast—but few people think rigorously about how we know what’s actually working. In our latest episode of the Ignite Podcast, we spoke with Minha Hwang, Principal Applied Scientist at Microsoft, to break down the messy, high-stakes world of AI experimentation, model evaluation, and what comes after training your models.
With a unique journey spanning MIT, McKinsey, academia, and Microsoft, Minha combines deep technical expertise with a pragmatic business lens. This blog unpacks the most valuable lessons from our conversation.
🎯 From Data Storage to Decision Science
Minha’s path is anything but linear:
PhD #1 in materials science at MIT
McKinsey consultant working across industries
PhD #2 in marketing science—before becoming a professor at McGill
And today, he leads high-impact experimentation systems at Microsoft
What ties it all together? A relentless focus on data-driven decision making and understanding the real impact behind the numbers.
🧪 Why A/B Testing Isn’t Enough Anymore
Most companies lean heavily on A/B testing. But Minha warns of a harsh reality:
“False positives are shockingly common—especially when teams run too many tests, with too many metrics, on too little traffic.”
He outlines how Microsoft tackles this:
Proxy metrics to detect signal faster
Variance reduction techniques using ML
Repeat experiments to validate surprising results (“solidification”)
These practices help Microsoft scale experimentation without sacrificing trust in the data.
🔍 Causal Inference: The Most Underrated Skill in ML
While machine learning is great for prediction, Minha argues that causal inference—understanding what actually caused an outcome—is what truly drives business impact.
“Most ML teams are mapping X to Y. But businesses want to know: if I change X, what happens to Y?”
He highlights tools like observational causal inference, counterfactual reasoning, and A/B tests—but notes most data science programs underemphasize them.
🤖 Evaluating LLMs: The New Frontier
As Microsoft integrates large language models (LLMs) into more products, experimentation gets trickier:
A/B testing LLM features often lacks clean control groups
Standard metrics don’t always reflect user preference or quality
Evaluation becomes more about human preferences and offline metrics
This shift demands a new mindset—one that blends rigorous experimentation with deep qualitative insight.
🧠 The Case for Open Source and Reinforcement Learning
Minha is optimistic about:
Open-weight models like DeepSeek as democratizers of AI innovation
Reinforcement learning as a path beyond the limits of human-labeled data
“If we want models to go beyond human-level intelligence, we’ll need them to learn from experience—not just our data.”
He predicts RL and simulated environments will play a growing role in training next-gen AI.
🚀 What Comes After LLMs?
While LLMs dominate headlines, Minha is thinking ahead:
Smarter pricing agents for small businesses
Non-LLM applications with direct business value
Eventually, robotics and physical AI, where visual and tactile learning replaces pure text-based intelligence
The future, he believes, will demand more than language—it will require systems that understand, act, and adapt.
💡 Final Thought
Amid the AGI debates and benchmark hype, Minha offers a grounded view:
“As an engineer, I don’t care if it’s AGI. What matters is—does it solve the problem? Is it useful?”
That’s a philosophy worth holding onto in today’s rapidly evolving AI landscape.
🎧 Want to Go Deeper?
Listen to the full episode with Minha Hwang for stories, frameworks, and strategies you won’t hear anywhere else. Whether you're building AI systems or evaluating their business impact, this one’s a masterclass.
👂🎧 Watch, listen, and follow on your favorite platform: https://tr.ee/S2ayrbx_fL
🙏 Join the conversation on your favorite social network: https://linktr.ee/theignitepodcast
Chapters:
00:00 Intro
00:40 Minha’s Engineering Roots and PhD at MIT
01:55 Jumping from Engineering to Consulting at McKinsey
03:15 Why He Went Back for a Second PhD
04:35 Transition from Academia to Applied Data Science
06:00 Building McKinsey’s Data Science Arm
07:30 Moving to Microsoft to Explore Unstructured Data
08:40 Making A/B Testing More Sensitive with ML
10:00 Why False Positives Are a Massive Problem
11:05 How to Validate Experiments Through “Solidification”
12:10 The Importance of Proxy and Debugging Metrics
13:35 Model Compression and Quantization Explained
15:00 Balancing Statistical Rigor with Product Speed
16:30 Why Data, Not Model Training, Is the Bottleneck
18:00 Causal Inference vs. Machine Learning
20:00 Measuring What You Can’t Observe
21:15 The Missing Role of Causality in AI Education
22:15 Reinforcement Learning and the Data Scarcity Problem
23:40 The Rise of Open-Weight Models Like DeepSeek
25:00 Can Open Source Overtake Closed Labs?
26:15 IP Grey Areas in Foundation Model Training
27:35 Multimodal Models and the Future of Robotics
29:20 Simulated Environments and Physical AI
30:25 AGI, Overfitting, and the Benchmark Illusion
32:00 Practical Usefulness > Philosophical Debates
33:25 Most Underrated Metrics in A/B Testing
34:35 Favorite AI Papers and Experimentation Tools
36:30 Measuring Preferences with Discrete Choice Models
36:55 Outro
Share this post