Two orthogonal cost optimizations to the Captain agent pipeline:
1. Hierarchical model routing (optimization A)
Captain::Scenario now overrides agent_model to read a dedicated
InstallationConfig CAPTAIN_OPEN_AI_MODEL_SCENARIO, falling back to the
global CAPTAIN_OPEN_AI_MODEL used by the orchestrator (Assistant).
Rationale: the orchestrator (Jasmine) does cheap triage (is this a
reservation intent? a greeting? escalate to human?) — a smaller model
handles this well. Scenarios (Daniela — reserva) run complex flows with
tool calling, strict taxonomies, and JSON schema output — they benefit
from a stronger model.
Config in this install: CAPTAIN_OPEN_AI_MODEL=gpt-4o-mini (orchestrator)
and CAPTAIN_OPEN_AI_MODEL_SCENARIO=gpt-4o (scenarios). Estimated ~60%
cost reduction vs everything on gpt-4o, preserving quality where it
matters for the business flow.
2. Conversation-level memory cache (optimization B)
MemoryPromptInjector now persists the computed memory block on
conversation.custom_attributes[captain_cached_memory_block]. First turn
computes once (embedding + pgvector query + XML formatting); subsequent
turns reuse. The customer's profile does not change during an open
conversation, so re-running the pipeline on every turn was pure waste.
Graceful fallbacks:
- Cache write failure → per-service-instance in-memory fallback still
applies.
- Cache read failure → fresh recall runs (no regression).
- Contact mismatch → invalidates cache, fresh recall runs.
When a new conversation starts, custom_attributes is empty → fresh
recall populates the cache for that conversation's lifetime.
Estimated ~80% reduction in embedding + pgvector calls during
multi-turn conversations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>