Research and Articles

RESEARCH·MAY 2026

Training a specialized hotel booking model

How general-purpose LLMs drop constraints across multi-turn booking conversations, and what a specialized small model trained on hotel workflows changes when benchmarked against GPT-4o.

ESSAY·MAY 2026

Lessons from evaluating AI in production

What a year of running evaluation infrastructure across regulated-industry deployments taught us about reliability, eval design, and why production traces matter more than benchmarks.

RESEARCH·APRIL 2026

Domain-specific evals for financial services

Building evaluation frameworks that capture the nuances of compliance, risk assessment, and regulatory requirements in financial AI deployments.

ESSAY·MARCH 2026

Why enterprise AI pilots fail

An analysis of common failure modes in enterprise AI deployments and the systematic approach needed to move from proof-of-concept to production.