SearchSwarm

Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Real tasks can grow almost unbounded, yet a model's context is finite. We teach agentic LLMs delegation intelligence: to decompose a long-horizon task, delegate bounded subtasks to its own subagents, and integrate their condensed, evidence-grounded results, an active form of context management that lets a single model take on far more than its context alone allows.

Paper Code Model Dataset

SearchSwarm Team

Delegation as active context management

A single model delegates bounded subtasks to subagents in separate contexts, which return only condensed results, keeping the main context clear.

High-quality delegation SFT data

We synthesize and release fine-tuning trajectories that teach when to delegate, how to brief a subagent, and how to verify what comes back.

30B-A3B SOTA

SearchSwarm leads every model at its scale on BrowseComp, BrowseComp-ZH, GAIA, and xbench-DeepSearch across all four benchmarks.

Results

Benchmark Comparisons

SearchSwarm is the state-of-the-art model at the 30B scale, across all four benchmarks.

Demo

Trajectories in Action

Real runs: watch the main agent decompose a question, delegate to subagents, and synthesize a cited final answer.

Short-Answer Deep Research

Eight cases that require multi-hop evidence gathering before returning a concise, source-grounded answer.

Lokomotiv Moscow Qualifier ChainA sports-fact puzzle connecting club history, World Cup qualification, refereeing, and player records. Coomera Connector Infrastructure ChainAn infrastructure puzzle linking state funding, motorway opening, rail construction, and project naming. Foujita and Montparnasse TriangulationA relational art-history search across exhibitions, influences, memoirs, and Paris atelier geography. Wrestling Holds and Great Lakes TitlesA long-chain sports-history case covering submission holds, trainers, territories, and championship timelines. Advance Australia Foundation 档案识别围绕 SBS、议会文件编号、许可记录与政府资助的多跳检索。 WMO、Alerta Rio 与灾害响应年份串联 WMO 全球高温记录、里约 Alerta Rio 气象监测系统和城市应急响应机构。东京到深圳的城市开发线索串联东京站城一体化开发、深圳集中连片地块出让和高层工业厂房入驻政策。新疆花儿民间音乐人物线索串联出生年代、东征历史、特克斯县地名和回族花儿表演者身份。

Open-Ended Deep Research

Three long-form synthesis cases with source-grounded reports.

Horizontal Gene Transfer in Plants and AnimalsA broad scientific synthesis on eukaryotic HGT, mechanisms, rarity, and evolutionary significance. Light-Based Aesthetic MedicineA 2020-2026 survey of laser, IPL, and LED therapies for photoaging and pigmentation. 脑缺血中的中性粒细胞功能演变急慢性期功能变化、免疫细胞互作、临床结局与未来工作。

Method

SearchSwarm Framework

The main agent owns the research mainline: it decomposes the question, delegates bounded evidence-gathering to subagents, and integrates the condensed, source-grounded reports they return.

SearchSwarm architecture and execution flow

SearchSwarm at a glance. The main agent dispatches bounded subtasks to subagents that run in their own fresh contexts and return condensed, cited reports, which re-enter the main agent's context for verification and synthesis.

1Encourage delegationEvery token spent on raw retrieval is one not spent on reasoning, so the harness pushes the main agent to delegate multi-step gathering and reserve its context for decomposition, verification, and synthesis.

2Comprehensive briefingThe main agent briefs each subagent like a new collaborator: not just the subtask, but why it matters, what is already established, and what is still uncertain, so it works on target.

3Main agent retains core judgmentSubagents gather; the main agent decides. It checks each finding against its sources, adjudicates conflicts, and chooses which hypotheses to pursue or drop.

4Citation-grounded reportingEvery subagent conclusion carries inline citations to its sources, and the main agent propagates them into a final answer whose explanation is traceable end to end.

Leaderboard

Performance Table

Baseline numbers are taken from the respective technical reports or model cards; an asterisk (*) marks results that use context management.

Model	Size	BrowseComp	BrowseComp-ZH	GAIA	xbench-DeepSearch-2505
Closed-source models
GPT-5.2-Thinking	--	65.8	76.1	--	--
GPT-5	--	54.9	65.0	76.4	77.8
Claude-4.5-Opus	--	67.8	62.4	71.5	--
Claude-4.5-Sonnet	--	24.1	42.4	66.0	66.5
Gemini-3.0-Pro	--	59.2	66.8	74.8	--
Seed-2.0-Pro	--	77.3*	82.4*	78.6	--
Open-source models
Kimi-K2.5	1T-A32B	78.4*	--	--	--
GLM-4.7	355B-A32B	67.5*	66.6*	--	72.0
GLM-5.0	744B-A40B	75.9*	72.7*	--	--
DeepSeek V3.2	671B-A37B	67.6*	65.0*	75.1	78.0
LongCat-Flash-Thinking-2601	560B-A27B	73.1*	77.7*	--	--
MiniMax-M2	230B-A10B	44.0	--	75.7	72.0
MiniMax-M2.5	230B-A10B	76.3*	--	--	--
Step-3.5-Flash	196B-A11B	69.0*	66.9	84.5	83.7
Open-source lightweight models
Tongyi DeepResearch	30B-A3B	43.4	46.7	70.9	75.0
Tongyi DR Swarm	30B-A3B	≈43.4	≈46.7	≈70.9	≈75.0
RedSearcher	30B-A3B	57.4*	58.2*	80.1	--
LongSeeker	30B-A3B	61.5*	62.5*	77.7*	78.0*
MiroThinker-1.5-mini	30B-A3B	56.1*	66.8*	72.0*	73.1*
MiroThinker-1.7-mini	30B-A3B	67.9*	72.3*	80.3*	--
SearchSwarm (Ours)	30B-A3B	68.1*	73.3*	82.5*	80.8*

Generalization

Open-Ended Deep Research

Trained only on short-answer queries, SearchSwarm still transfers to long-form, multi-source synthesis.

Model	ScholarQA-v2	HealthBench	ResearchQA	DeepResearchBench	Average
Closed-source systems
OpenAI DeepResearch	79.6	53.8	79.2	46.9	64.9
Perplexity DeepResearch	67.3	--	75.3	42.3	--
Gemini-3.1-Pro + search	--	47.5	74.5	44.4	--
Open-source models
Qwen3-8B	40.4	16.5	56.1	33.3	36.6
QwQ-32B	41.9	24.5	60.9	40.3	41.9
Tongyi DeepResearch	46.5	46.2	66.7	40.6	50.0
WebThinker-32B-DPO	46.7	39.4	74.2	40.6	50.2
Dr.Tulu	88.3	52.8	75.7	45.4	65.6
SearchSwarm (Ours)	79.2	52.8	80.2	44.4	64.2