Agentic Deep Research: Recursive Reasoning and Self-Correction Engine

Overview

An agentic deep research engine that scales inference-time compute to answer complex, multi-faceted questions through iterative reasoning loops with self-correction. The system implements a four-agent, DAG-orchestrated research pipeline built on LangGraph’s StateGraph — where a Planner decomposes queries into dependency-aware sub-question DAGs, parallel Executors perform web search with context-aware summarization, an Aggregator synthesizes findings, and a Critic evaluates completeness and dynamically extends the research plan when gaps are identified. Unlike static RAG, the research plan grows organically based on intermediate findings, achieving +14.1% factuality improvement and 2× the perfect-answer rate over single-shot baselines on a 50-question compositional benchmark.

Multi-Agent Architecture

The coordination is managed through a compiled LangGraph StateGraph with conditional edges and Pydantic-validated structured outputs at every LLM boundary:

Agent	Model	Role
Planner	Claude Sonnet 4	Decomposes complex queries into dependency-aware DAGs of atomic `SubQuery` nodes (id, question, dependencies, reasoning) via structured output
Executor	Tavily Search + GPT-4o-mini	Identifies runnable steps (dependencies satisfied), executes web searches in parallel via `ThreadPoolExecutor`, and synthesizes results with dependency context
Aggregator	Claude Sonnet 4	Builds structured context from all completed sub-questions and synthesizes a comprehensive, citation-grounded answer
Critic	Claude Sonnet 4	Harsh “Research Director” evaluation — assesses completeness, factuality, and specificity; generates new `SubQuery` nodes to fill identified gaps
Judge	GPT-4o	Independent A/B evaluation with LLM-as-a-Judge scoring (offline, not in the inference loop)

Model Stratification: Expensive reasoning models (Claude Sonnet) are used only for planning, critique, and synthesis; cost-efficient models (GPT-4o-mini) handle mechanical summarization — keeping costs to ~$0.10/query despite the multi-step pipeline.

Core Research Loop

ENTRY → Planner → Executor ──(more runnable steps?)──→ Executor (wave execution)
                                    ↓ (all steps done)
                              Aggregator → Critic ──(gaps found?)──→ Executor (new sub-questions)
                                                        ↓ (sufficient or max_loops reached)
                                                       END

Dynamic DAG Planning

The Planner produces a Pydantic-validated ResearchPlan with explicit dependency graphs. The key innovation: the DAG is not static. When the Critic identifies gaps, it generates new SubQuery nodes with IDs starting after the current maximum. Because the AgentState.plan field uses Annotated[List[SubQuery], operator.add], new steps are appended seamlessly without plan reconstruction — implementing genuine inference-time adaptive computation.

Dependency-Aware Parallel Execution

The Executor implements DAG-based scheduling: only steps with satisfied dependencies execute, and they run concurrently via ThreadPoolExecutor(max_workers=min(len(runnable), 5)). Each step’s summarization includes dependency context — later steps leverage findings from earlier steps during synthesis, enabling true multi-hop reasoning where one search informs the interpretation of the next. After each execution wave, newly-runnable steps are identified for the next wave.

Self-Correction Loop

The Critic uses a deliberately harsh evaluation prompt and returns a structured Assessment (is_sufficient, feedback, new_sub_questions). Three safety mechanisms bound the recursion: (1) max_loops=3 circuit breaker, (2) completion check when the critic is satisfied, (3) pending-step detection to avoid empty iterations.

Evaluation Framework

Benchmark: A curated Golden Set of 50 hard compositional questions spanning 7 categories — multi-hop reasoning, temporal comparison, temporal-financial, aggregation, temporal-factual, temporal-economic, and direct comparison.

Protocol: A/B comparison against a NaiveRAG baseline (single Tavily search + single GPT-4o-mini answer). GPT-4o judges both systems on completeness, factuality, and coherence (1–5 scale) via Pydantic-validated structured output. The harness supports checkpoint resumability via pickle-based EvaluationCheckpoint with per-question tracking, cost estimation, and category-level analysis.

Results

Metric	RecursiveAgent	NaiveRAG	Improvement
Factuality	4.04/5	3.54/5	+14.1%
Completeness	4.42/5	4.30/5	+2.8%
Coherence	4.64/5	4.54/5	+2.2%
Perfect Scores (5/5/5)	30%	16%	+14pp
Win Rate	50%	14%	—

Category strengths: Temporal-comparison (80% win rate), temporal-economic (75%), temporal-financial (57%). The agent excels on queries requiring cross-temporal synthesis and multi-source aggregation — precisely the scenarios where iterative DAG expansion and self-correction provide the most value.

Tech Stack

Python, LangGraph, LangChain, Pydantic, Anthropic Claude (Sonnet 4), OpenAI GPT-4o / GPT-4o-mini, Tavily Search API, PyTorch