Low-latency warm recall
Run a local recall service that keeps the store, vector index, embedder, and reranker warm — so hook- and agent-loop calls stay under the 500ms budget.
Start the service
The service binds to 127.0.0.1 by default and is intended for local use. Keep bearer auth enabled whenever agent tools can reach localhost.
$env:HEARTWOOD_RECALL_TOKEN = "replace-with-local-secret"
python -m heartwood.cli serve-recall `
--db .\heartwood.db `
--tenant tenant:ops `
--warm-tenant tenant:acme-payments `
--warm-tenant tenant:northwind-retail `
--host 127.0.0.1 --port 8765 `
--token $env:HEARTWOOD_RECALL_TOKENRecall
Both embedded one-shot recall and warm-service recall return JSON with recall_id, latency_ms, index_lag, result metadata, provenance validation, ranking signals, and source IDs.
python -m heartwood.cli recall `
--url http://127.0.0.1:8765 `
--token $env:HEARTWOOD_RECALL_TOKEN `
--tenant tenant:acme-payments `
--principal-id agent:orchestrator `
--query "what guidance applies to Acme Payments audit details?" `
--k 5 --jsonProve the 500ms budget
Run the benchmark against the warm service before cutting over any latency-sensitive caller. It reports p50, p95, max latency, and pass status.
python -m heartwood.cli bench-recall `
--url http://127.0.0.1:8765 `
--token $env:HEARTWOOD_RECALL_TOKEN `
--tenant tenant:acme-payments `
--principal-id agent:orchestrator `
--query "Acme Payments audit provenance guidance" `
--repeat 10 --max-p95-ms 500 --require-passHTTP surface
• GET /health — readiness and model/index names • POST /recall — governed recall (send Authorization: Bearer <token> when a token is set) • GET /metrics — process-local recall latency counters and p95 • POST /warm — warm additional tenants
Adapted from docs/integrations/warm-recall.md in the open-source core.