Search Efficiency as the AI Execution Bottleneck

Search Efficiency as the AI Execution Bottleneck

Trying to work my head around the role of metadata of scholarly records in bolstering LLM use in AI4S, and came across this paper today.

In harness engineering, the core challenge is not the model’s reasoning capability — it’s the quality and efficiency of context retrieval. Semantic search becomes the critical bottleneck because every tool call, every decision, every subtask depends on feeding the agent the right context at the right time.

What “High Quality” Context Means

High quality context must satisfy all four properties simultaneously:

High relevance — directly pertinent to the current task or subtask Accuracy — factually correct, not stale or contradicted by more recent state Comprehensive — no critical gaps that would cause the agent to hallucinate missing information Efficient — just enough to execute the task well, without padding that consumes context window unnecessarily

The tension between comprehensive and efficient is the core design challenge. Too little context → agent errors or hallucinations. Too much → context bloat, slower inference, higher cost, and degraded attention over long sequences.

Why This Bottlenecks AI Execution

Harness engineering requires a large volume of context passes — tool descriptions, prior outputs, memory retrievals, plan state, environmental state. Each pass compounds the search problem:

  1. Volume — many independent retrievals per task, each needing precision
  2. Latency — slow or repeated retrieval stalls agent loops
  3. Precision loss at scale — as the knowledge base grows, recall degrades without careful chunking, embedding strategy, and reranking
  4. Context window pressure — over-retrieved context displaces working memory needed for reasoning

If semantic search is weak, the harness compensates by injecting more context “just in case” — which is exactly the wrong tradeoff. The result is a bloated, expensive, fragile pipeline.

Design Implications

The harness should retrieve the minimum sufficient context — not the maximum safe context. This requires high-precision search, not high-recall search.

This inverts the default instinct of “retrieve more to be safe.” Instead:

  • Use reranking (cross-encoder or LLM-as-judge) to cut low-relevance results after initial retrieval
  • Chunk at semantic boundaries, not fixed token counts, to preserve coherence
  • Cache frequently-accessed high-relevance context to reduce repeated retrieval overhead
  • Treat context budget as a first-class constraint in harness design — not an afterthought

Open Questions

At what retrieval precision does semantic search stop being the bottleneck? How do you define “minimum sufficient” programmatically without task-specific oracles? Can the agent itself assess whether its context is sufficient before executing, reducing error recovery loops?