diff --git a/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/Kimi K2.5 Visual Agentic Intelligence Technical Report.html b/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/Kimi K2.5 Visual Agentic Intelligence Technical Report.html new file mode 100644 index 00000000..990c19bb --- /dev/null +++ b/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/Kimi K2.5 Visual Agentic Intelligence Technical Report.html @@ -0,0 +1,616 @@ + + +
+ + +<<<<<<< HEAD +Kimi K2.5 is introduced as the most powerful open-source model to date, building on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens. As a native multimodal model, K2.5 delivers state-of-the-art coding and vision capabilities alongside a self-directed agent swarm paradigm. For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls, reducing execution time by up to 4.5× compared to single-agent setups without predefined workflows.
+ +The animation below illustrates the self-directed swarm paradigm: a trainable Orchestrator dynamically instantiates specialized subagents, distributes tasks in parallel (orange), and aggregates results (green) to minimize total latency via the Critical Steps metric.
+ ++ Kimi K2.5 represents a meaningful step toward AGI for the open-source community, demonstrating strong capability on real-world tasks under real-world constraints. Grounded in advances in coding with vision, agent swarms, and office productivity, the model redefines boundaries of AI in knowledge work. The architecture achieves up to 4.5x execution time reduction through parallel agent coordination, while maintaining state-of-the-art performance across coding, vision, and agentic benchmarks at significantly lower cost than proprietary alternatives. +
+ + +======= +Kimi K2.5 achieves state-of-the-art performance across HLE-Full, BrowseComp, SWE-Bench Verified, MMMU Pro, MathVision, and VideoMMMU benchmarks. The Agent Swarm demonstrates up to 4.5× reduction in execution time and 80% reduction in end-to-end runtime compared to single-agent setups. The model delivers strong performance at significantly lower cost (up to 21.1× savings on BrowseComp compared to GPT-5.2).
+Kimi K2.5 represents a meaningful step toward AGI for the open-source community, demonstrating strong capability on real-world tasks under real-world constraints. The integration of coding with vision, agent swarms, and office productivity capabilities positions the model as a comprehensive solution for knowledge work. The research demonstrates that at scale, the trade-off between vision and text capabilities disappears, with both improving in unison through continued multimodal pretraining. Future work will push further into the frontier of agentic intelligence.
+ + +>>>>>>> a3f3d64a66d0e8e300c116a6c0948dcdbcee782e + + \ No newline at end of file diff --git a/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/kimi_vision_paper_to_page_workflow.ipynb b/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/kimi_vision_paper_to_page_workflow.ipynb index 1d057e72..2d6d7c9d 100644 --- a/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/kimi_vision_paper_to_page_workflow.ipynb +++ b/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/kimi_vision_paper_to_page_workflow.ipynb @@ -25,7 +25,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -59,7 +59,7 @@ "# Model configuration\n", "MODEL_NAME = \"moonshotai/kimi-k2.5\"\n", "DEFAULT_TEMPERATURE = 0.7\n", - "DEFAULT_MAX_TOKENS = 32768 # Large for HTML generation\n", + "DEFAULT_MAX_TOKENS = 32768 \n", "\n", "# Get API key\n", "NVIDIA_API_KEY = os.getenv(\"NVIDIA_API_KEY\")\n", @@ -79,7 +79,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -140,7 +140,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -186,7 +186,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -273,7 +273,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -411,7 +411,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -451,7 +451,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -515,7 +515,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -533,7 +533,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 15, "metadata": {}, "outputs": [ { @@ -546,12 +546,26 @@ "Extracting PDF: /home/chris/Code/NVIDIA/GenerativeAIExamples/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/Kimi K2.5 Visual Agentic Intelligence Technical Report.pdf\n", " Extracted 15 pages\n", " Converted to base64\n", - "Analyzing paper with vision model...\n", + "Analyzing paper with vision model...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/chris/Code/NVIDIA/GenerativeAIExamples/oss_tutorials/Kimi_K2.5_Paper_to_Page_Vision_Workflow/.venv/lib/python3.13/site-packages/langchain_nvidia_ai_endpoints/_common.py:243: UserWarning: Found moonshotai/kimi-k2.5 in available_models, but type is unknown and inference may fail.\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ " Sending 15 pages to vision model...\n", " Analysis complete: Kimi K2.5: Visual Agentic Intelligence\n", "Generating minimal webpage with Kimi...\n", " Calling Kimi to generate HTML (minimal)...\n", - " Generated 13126 characters of HTML\n", + " Generated 13667 characters of HTML\n", "==================================================\n", "Processing complete!\n", "\n", @@ -566,321 +580,42 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ - "\n", - "\n", - "\n", - " \n", - " \n", - "\n", - " Kimi K2.5 is introduced as the most powerful open-source model to date, building on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens. Built as a native multimodal model, K2.5 delivers state-of-the-art coding and vision capabilities alongside a self-directed agent swarm paradigm. For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls, reducing execution time by up to 4.5× compared to single-agent setups.\n", - "
\n", - "\n", - " The animation below illustrates the core innovation: a trainable Orchestrator dynamically instantiates specialized subagents (AI Researcher, Physics Researcher, etc.), assigns tasks in parallel, and aggregates results—enabling up to 100 concurrent subagents without predefined workflows.\n", - "
\n", - " \n", - "\n", - " Kimi K2.5 represents a significant advancement toward AGI for the open-source community, demonstrating that vision and coding capabilities improve in unison at scale. The Agent Swarm paradigm enables practical parallel execution for complex real-world tasks, achieving 80% reduction in end-to-end runtime through self-directed orchestration of up to 100 subagents. By eliminating the traditional trade-off between vision and text capabilities, K2.5 establishes a new standard for multimodal agentic intelligence in knowledge work.\n", - "
\n", - "