diff --git a/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/Kimi K2.5 Visual Agentic Intelligence Technical Report.html b/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/Kimi K2.5 Visual Agentic Intelligence Technical Report.html new file mode 100644 index 00000000..990c19bb --- /dev/null +++ b/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/Kimi K2.5 Visual Agentic Intelligence Technical Report.html @@ -0,0 +1,616 @@ + + + + + +<<<<<<< HEAD + Kimi K2.5: Visual Agentic Intelligence + + + +

Kimi K2.5: Visual Agentic Intelligence

+<<<<<<< HEAD +
Technical Report
+ +
+ Abstract: Kimi K2.5 represents the most powerful open-source model to date, building on Kimi K2 with continued pretraining over approximately 15 trillion mixed visual and text tokens. As a native multimodal model, K2.5 delivers state-of-the-art coding and vision capabilities alongside a self-directed agent swarm paradigm. The model can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls, reducing execution time by up to 4.5x compared to single-agent setups. Key capabilities include coding with vision (front-end development, image/video-to-code generation), autonomous visual debugging, and office productivity tasks (documents, spreadsheets, PDFs, slide decks). +
+ +

Key Contributions

+ + +

Agent Swarm Architecture

+
+
Self-Directed Parallel Execution (PARL)
+
+
Orchestrator
+
AI
Researcher
+
Physics
Researcher
+
Fact
Checker
+
Web
Developer
+
Code
Debugger
+
Doc
Analyzer
+
Aggregated Result
+
Cycle: Task Distribution → Parallel Processing → Result Aggregation
+======= +
Technical Summary • Open Source Model Release
+ +

Abstract

+

Kimi K2.5 is introduced as the most powerful open-source model to date, building on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens. As a native multimodal model, K2.5 delivers state-of-the-art coding and vision capabilities alongside a self-directed agent swarm paradigm. For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls, reducing execution time by up to 4.5× compared to single-agent setups without predefined workflows.

+ +

Key Contributions

+
+
    +
  • Native multimodal architecture with continued pretraining on 15T visual and text tokens
  • +
  • Self-directed agent swarm paradigm supporting up to 100 sub-agents and 1,500 parallel tool calls
  • +
  • State-of-the-art coding capabilities with vision (image/video-to-code generation)
  • +
  • Parallel-Agent Reinforcement Learning (PARL) for training parallel orchestrators
  • +
  • Critical Steps metric for evaluating parallel execution efficiency
  • +
  • Advanced office productivity capabilities handling documents, spreadsheets, PDFs, and slide decks
  • +
  • Kimi Code: open-source coding product with IDE integration and autonomous visual debugging
  • +
+
+ +

Agent Swarm Architecture

+

The animation below illustrates the self-directed swarm paradigm: a trainable Orchestrator dynamically instantiates specialized subagents, distributes tasks in parallel (orange), and aggregates results (green) to minimize total latency via the Critical Steps metric.

+ +
+
Orchestrator
(Trainable)
+ +
AI
Research
+
Physics
+
Code
Gen
+
Fact
Check
+
Web
Dev
+
Life
Sci
+ + +
+
+
+
+
+
+ + +
+
+
+
+
+
+ +
+
Task Distribution
+
Result Aggregation
+
+ Parallel-Agent RL (PARL)
+ CS = Torchestration + max(Tsubagent) +
+>>>>>>> a3f3d64a66d0e8e300c116a6c0948dcdbcee782e +
+
+ +

Core Concepts

+<<<<<<< HEAD + + +

Conclusions

+

+ Kimi K2.5 represents a meaningful step toward AGI for the open-source community, demonstrating strong capability on real-world tasks under real-world constraints. Grounded in advances in coding with vision, agent swarms, and office productivity, the model redefines boundaries of AI in knowledge work. The architecture achieves up to 4.5x execution time reduction through parallel agent coordination, while maintaining state-of-the-art performance across coding, vision, and agentic benchmarks at significantly lower cost than proprietary alternatives. +

+ + +======= +
+
    +
  • Agent Swarm: Self-directed coordination of up to 100 specialized subagents without predefined workflows, dynamically created for specific subtasks
  • +
  • Parallel-Agent Reinforcement Learning (PARL): Training methodology using a trainable orchestrator with frozen subagents. Uses staged reward shaping: R = αRtask + (1-α)Rparallel where α anneals from 0→1 to prevent serial collapse
  • +
  • Critical Steps: Latency-oriented metric measuring total execution time as CS = Torchestration + max(Tsubagenti), optimizing the critical path rather than just accuracy
  • +
  • Serial Collapse: Failure mode where orchestrator defaults to single-agent execution despite parallel capacity; prevented via auxiliary rewards for subagent instantiation
  • +
  • Coding with Vision: Native multimodal capability converting images and videos into functional code with interactive layouts and animations
  • +
  • Staged Reward Shaping: Training technique gradually shifting reward from encouraging parallelism (early training) to optimizing end-to-end task quality (late training)
  • +
+
+ +

Results

+
+

Kimi K2.5 achieves state-of-the-art performance across HLE-Full, BrowseComp, SWE-Bench Verified, MMMU Pro, MathVision, and VideoMMMU benchmarks. The Agent Swarm demonstrates up to 4.5× reduction in execution time and 80% reduction in end-to-end runtime compared to single-agent setups. The model delivers strong performance at significantly lower cost (up to 21.1× savings on BrowseComp compared to GPT-5.2).

+
+ +

Conclusions

+

Kimi K2.5 represents a meaningful step toward AGI for the open-source community, demonstrating strong capability on real-world tasks under real-world constraints. The integration of coding with vision, agent swarms, and office productivity capabilities positions the model as a comprehensive solution for knowledge work. The research demonstrates that at scale, the trade-off between vision and text capabilities disappears, with both improving in unison through continued multimodal pretraining. Future work will push further into the frontier of agentic intelligence.

+ +
+ Methodology Summary: Continued pretraining on ~15T mixed visual/text tokens; PARL training with staged reward shaping to prevent serial collapse; Critical Steps optimization for latency reduction; evaluation on AI Office Benchmark and General Agent Benchmark showing 59.3% and 24.3% improvements over K2 Thinking respectively. +
+>>>>>>> a3f3d64a66d0e8e300c116a6c0948dcdbcee782e + + \ No newline at end of file diff --git a/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/kimi_vision_paper_to_page_workflow.ipynb b/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/kimi_vision_paper_to_page_workflow.ipynb index 1d057e72..2d6d7c9d 100644 --- a/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/kimi_vision_paper_to_page_workflow.ipynb +++ b/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/kimi_vision_paper_to_page_workflow.ipynb @@ -25,7 +25,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -59,7 +59,7 @@ "# Model configuration\n", "MODEL_NAME = \"moonshotai/kimi-k2.5\"\n", "DEFAULT_TEMPERATURE = 0.7\n", - "DEFAULT_MAX_TOKENS = 32768 # Large for HTML generation\n", + "DEFAULT_MAX_TOKENS = 32768 \n", "\n", "# Get API key\n", "NVIDIA_API_KEY = os.getenv(\"NVIDIA_API_KEY\")\n", @@ -79,7 +79,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -140,7 +140,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -186,7 +186,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -273,7 +273,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -411,7 +411,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -451,7 +451,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -515,7 +515,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -533,7 +533,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 15, "metadata": {}, "outputs": [ { @@ -546,12 +546,26 @@ "Extracting PDF: /home/chris/Code/NVIDIA/GenerativeAIExamples/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/Kimi K2.5 Visual Agentic Intelligence Technical Report.pdf\n", " Extracted 15 pages\n", " Converted to base64\n", - "Analyzing paper with vision model...\n", + "Analyzing paper with vision model...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/chris/Code/NVIDIA/GenerativeAIExamples/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/.venv/lib/python3.13/site-packages/langchain_nvidia_ai_endpoints/_common.py:243: UserWarning: Found moonshotai/kimi-k2.5 in available_models, but type is unknown and inference may fail.\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ " Sending 15 pages to vision model...\n", " Analysis complete: Kimi K2.5: Visual Agentic Intelligence\n", "Generating minimal webpage with Kimi...\n", " Calling Kimi to generate HTML (minimal)...\n", - " Generated 13126 characters of HTML\n", + " Generated 13667 characters of HTML\n", "==================================================\n", "Processing complete!\n", "\n", @@ -566,321 +580,42 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ - "\n", - "\n", - "\n", - " \n", - " \n", - " Kimi K2.5: Visual Agentic Intelligence - Summary\n", - " \n", - "\n", - "\n", - "

Kimi K2.5: Visual Agentic Intelligence

\n", - "
Moonshot AI
\n", - " \n", - "
\n", - "

Abstract

\n", - "

\n", - " Kimi K2.5 is introduced as the most powerful open-source model to date, building on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens. Built as a native multimodal model, K2.5 delivers state-of-the-art coding and vision capabilities alongside a self-directed agent swarm paradigm. For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls, reducing execution time by up to 4.5× compared to single-agent setups.\n", - "

\n", - "
\n", - " \n", - "
\n", - "

Key Contributions

\n", - "
    \n", - "
  • Native multimodal architecture with state-of-the-art coding and vision capabilities
  • \n", - "
  • Agent Swarm paradigm enabling self-directed orchestration of up to 100 sub-agents executing 1,500+ parallel tool calls
  • \n", - "
  • Parallel-Agent Reinforcement Learning (PARL) for training swarm orchestration without predefined workflows
  • \n", - "
  • Coding with Vision capabilities for image/video-to-code generation and visual debugging
  • \n", - "
  • Real-world office productivity automation handling documents, spreadsheets, PDFs, and slide decks
  • \n", - "
\n", - "
\n", - " \n", - "
\n", - "

Agent Swarm Architecture

\n", - "

\n", - " The animation below illustrates the core innovation: a trainable Orchestrator dynamically instantiates specialized subagents (AI Researcher, Physics Researcher, etc.), assigns tasks in parallel, and aggregates results—enabling up to 100 concurrent subagents without predefined workflows.\n", - "

\n", - " \n", - "
\n", - "
\n", - "
Task Assignment
\n", - "
Result Aggregation
\n", - "
\n", - " \n", - "
Orchestrator
\n", - " \n", - "
\n", - "
\n", - "
\n", - "
\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "
\n", - "
\n", + "\n", + "
\n", - "
\n", - "
\n", - "
\n", - "
\n", - " \n", - "
\n", - "
AI Researcher
\n", - "
Physics Researcher
\n", - "
Fact Checker
\n", - "
Web Developer
\n", - "
Life Sciences
\n", - "
\n", - " \n", - "
Parallel Execution Active • 4.5× Speedup
\n", - "
\n", - " \n", - " \n", - "
\n", - "

Core Concepts

\n", - " \n", - "
\n", - " \n", - "
\n", - "

Conclusions

\n", - "

\n", - " Kimi K2.5 represents a significant advancement toward AGI for the open-source community, demonstrating that vision and coding capabilities improve in unison at scale. The Agent Swarm paradigm enables practical parallel execution for complex real-world tasks, achieving 80% reduction in end-to-end runtime through self-directed orchestration of up to 100 subagents. By eliminating the traditional trade-off between vision and text capabilities, K2.5 establishes a new standard for multimodal agentic intelligence in knowledge work.\n", - "

\n", - "
\n", - "\n", - "" + " >\n", + " " ], "text/plain": [ - "" + "" ] }, - "execution_count": 12, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# Display in notebook\n", - "from IPython.display import HTML\n", + "# Display the paper in the notebook\n", + "from IPython.display import IFrame\n", "from pathlib import Path\n", + "import base64\n", + "\n", + "html_content = Path(\"Kimi K2.5 Visual Agentic Intelligence Technical Report.html\").read_text()\n", + "encoded = base64.b64encode(html_content.encode()).decode()\n", "\n", - "HTML(Path(\"Kimi K2.5 Visual Agentic Intelligence Technical Report.html\").read_text())" + "IFrame(src=f\"data:text/html;base64,{encoded}\", width=\"100%\", height=800)" ] } ],