Research and Articles
RESEARCH·MAY 2026
Training a specialized hotel booking model
How general-purpose LLMs drop constraints across multi-turn booking conversations, and what a specialized small model trained on hotel workflows changes when benchmarked against GPT-4o.
ESSAY·MAY 2026
Lessons from evaluating AI in production
What a year of running evaluation infrastructure across regulated-industry deployments taught us about reliability, eval design, and why production traces matter more than benchmarks.
RESEARCH·APRIL 2026
Domain-specific evals for financial services
Building evaluation frameworks that capture the nuances of compliance, risk assessment, and regulatory requirements in financial AI deployments.
ESSAY·MARCH 2026
Why enterprise AI pilots fail
An analysis of common failure modes in enterprise AI deployments and the systematic approach needed to move from proof-of-concept to production.