Janus | AI Agent Evaluation for Critical Industries

The Problem

Hotel booking conversations are deceptively complex. A guest might start with a simple request for a room, then add constraints over multiple turns: specific dates, room preferences, accessibility requirements, loyalty program benefits, and budget limits.

General-purpose LLMs struggle to maintain these constraints across long conversations. Our analysis of 10,000 simulated booking sessions found that GPT-4o dropped at least one critical constraint in 23% of conversations exceeding 8 turns.

Our Approach

We trained a specialized 7B parameter model on a curated dataset of 500,000 hotel booking conversations, annotated with constraint tracking and resolution outcomes. The model was fine-tuned to explicitly maintain a constraint state across conversation turns.

Key innovations include a structured output format that surfaces active constraints at each turn, and a training objective that penalizes constraint violations more heavily than general response quality degradation.

Results

On our held-out evaluation set, the specialized model reduced constraint dropping to 3.2% across conversations of any length, compared to 23% for GPT-4o and 31% for Claude 3.5 Sonnet.

Importantly, the model maintained this reliability while running at 1/10th the inference cost, making it economically viable for high-volume booking applications where margin per transaction is measured in single-digit dollars.