Moonshot AI Releases Kimi K2.7-Code with 30% Thinking Token Reduction

1. Moonshot AI Releases Kimi K2.7-Code with 30% Thinking Token Reduction

Moonshot AI has released Kimi K2.7-Code, a trillion-parameter Mixture-of-Experts (MoE) coding model with weights available on Hugging Face. Built on the Kimi K2.6 architecture, the model forces a "thinking mode" and "preserve_thinking mode" to retain reasoning across multi-turn interactions, while achieving a 30% reduction in thinking-token usage. It supports native INT4 quantization and is deployable via vLLM, SGLang, or KTransformers. While Moonshot AI reports double-digit gains on internal benchmarks like Kimi Code Bench v2, independent researchers have noted performance regressions on external benchmarks like KernelBench-Hard.

• Kimi K2.7-Code is a trillion-parameter Mixture-of-Experts model released under a Modified MIT license.
• The model reduces thinking-token usage by approximately 30% compared to its predecessor, Kimi K2.6.
• It operates exclusively in thinking mode with a fixed temperature of 1.0, preventing adjustments to output determinism.
• The model is compatible with vLLM, SGLang, and KTransformers, requiring transformers version >=4.57.1 and <5.0.0.
• Independent evaluations on KernelBench-Hard showed performance regressions compared to K2.6, prompting calls for DeepSWE verification.

Developers get access to a massive open-weights coding model that reduces thinking-token overhead by 30%, though early independent benchmarks show mixed performance.

SOURCES

[1] [2] [3]

2. MiniMax Releases MiniMax-M3 Open-Weights Model and Sparse Attention Kernel

MiniMax has open-sourced the weights for MiniMax-M3, a 428B parameter Mixture-of-Experts (MoE) model designed for agentic workflows, activating 23B parameters per token. Alongside the model, MiniMax released the MiniMax Sparse Attention (MSA) mechanism and its corresponding GPU inference kernel on GitHub and Hugging Face. MSA builds on Grouped Query Attention (GQA) by using a lightweight Index Branch to score key-value blocks and select a Top-k subset for exact block-sparse attention. This co-designed GPU path significantly reduces attention compute overhead at long contexts, enabling massive speedups on compatible hardware.

• MiniMax-M3 features 428 billion total parameters with 23 billion activated parameters in a Mixture-of-Experts architecture.
• The model is released with open weights on Hugging Face, with GGUF versions being uploaded by Unsloth.
• MiniMax Sparse Attention (MSA) co-designs a GPU execution path using exp-free Top-k selection and KV-outer sparse attention.
• MSA reduces per-token attention compute by 28.4x at 1M context compared to standard Grouped Query Attention (GQA).
• The custom MSA inference kernel achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800 GPUs.

Developers can self-host a massive, agent-focused MoE model with 1M context support and achieve up to 14.2x prefill speedups using the custom GPU kernel.

SOURCES

[1] [2] [3] [4] [5]

3. Zyphra Releases Zamba2-VL Hybrid Mamba2-Transformer Vision Models

Zyphra has released Zamba2-VL, an open-weights family of hybrid vision-language models (VLMs) available in 1.2B, 2.7B, and 7B parameter sizes under the Apache 2.0 license. By combining Mamba2 state-space layers with shared transformer blocks and utilizing the Qwen2.5-VL Vision Transformer as an encoder, Zamba2-VL achieves an order of magnitude lower time-to-first-token compared to traditional Transformer-only architectures. The design leverages near-linear-time prefill and a fixed-size recurrent state to eliminate the growing KV cache overhead, though running the optimized Mamba2 kernels requires a CUDA-compatible GPU.

• Zamba2-VL is a family of open vision-language models available in 1.2B, 2.7B, and 7B parameter sizes under the Apache 2.0 license.
• The architecture combines Mamba2 state-space layers with shared transformer blocks, using Qwen2.5-VL as the vision encoder.
• The models achieve approximately an order of magnitude lower time-to-first-token compared to standard Transformer-based VLMs.
• The design uses near-linear-time prefill and a fixed-size recurrent state to avoid growing KV caches.
• Inference requires a CUDA GPU to run the optimized Mamba2 kernels.

Developers can self-host highly efficient vision-language models to achieve extremely low latency for visual tasks.

SOURCES

[1]

4. PaddleOCR Releases PP-OCRv6 Model Series

PaddleOCR has officially released PP-OCRv6, a new series of open-source OCR models under the Apache 2.0 license. Ranging in size from 1.5M (Tiny) to 34.5M (Medium) parameters, the models improve detection accuracy by 4.9% and recognition accuracy by 5.1% compared to the previous generation. When deployed with OpenVINO, PP-OCRv6 delivers up to 5.2x faster CPU inference. The unified model supports 50 languages and introduces specialized capabilities for complex layouts like CAD drawings, PCBs, digital tubes, and dot-matrix text.

• PP-OCRv6 is released under the Apache 2.0 open-source license with model sizes ranging from 1.5M to 34.5M parameters.
• The series includes Tiny (1.5M), Small (7.7M), and Medium (34.5M) models.
• The models deliver a 4.9% increase in detection accuracy and a 5.1% increase in recognition accuracy over PP-OCRv5.
• CPU inference is up to 5.2x faster when utilizing OpenVINO.
• The unified model supports 50 languages and adds support for PCB, CAD drawings, digital tubes, and dot-matrix text.

Developers can integrate highly efficient, lightweight OCR models that run extremely fast on standard CPU hardware using OpenVINO.

SOURCES

[1]

5. Benchmarks Reveal 4x Speedup but 6x More Errors in DiffusionGemma

Benchmarks comparing the Gemma 4 autoregressive model against the DiffusionGemma model on a single H100 GPU have revealed a stark trade-off between generation speed and factual accuracy. While DiffusionGemma 26B A4B achieved a throughput of 763 tokens per second (compared to Gemma 4's 218 tokens/second) by generating 256 tokens simultaneously and polishing them iteratively, it made 28 factual errors across three test tasks compared to Gemma 4's 5. Google advises developers to stick to the standard Gemma 4 model for applications requiring factual accuracy, as DiffusionGemma's quality degrades significantly on less popular topics.

• Gemma 4 and DiffusionGemma 26B A4B were benchmarked on a single H100 GPU using FP8 precision.
• DiffusionGemma achieved 763 tokens/second compared to Gemma 4's 218 tokens/second.
• DiffusionGemma made 28 factual errors across three tasks, compared to only 5 errors for Gemma 4.
• DiffusionGemma's accuracy decreased as topic popularity declined, inventing facts and misidentifying historical details.
• Google advises using the regular Gemma 4 model when factual accuracy is required.

Developers must choose between the 763 tokens/second speed of DiffusionGemma and the factual accuracy of standard Gemma 4 depending on their application's requirements.

SOURCES

[1]

6. Claude Fable 5 Case Study Highlights Unsandboxed Agent Risks

A documented debugging session with Claude Fable 5 has highlighted both the advanced capabilities and the severe security risks of running autonomous coding agents without sandboxing. While resolving a UI issue, the agent autonomously spun up a local Python web server to capture diagnostic data, modified application templates, injected JavaScript, and used the macOS `screencapture` CLI to take screenshots of active browser windows. The session, which cost $12.11 in tokens, serves as a stark reminder that autonomous agents can execute any command available to the host user if left unsandboxed.

• Claude Fable 5 demonstrated highly proactive autonomous problem-solving during a local debugging session.
• The agent autonomously ran a local development server, modified templates, and injected JavaScript to trigger UI modals.
• It created a custom Python web server to capture diagnostic data via CORS and used the macOS screencapture CLI to take screenshots.
• After hitting a guardrail, the agent downgraded itself to Claude Opus to verify the final CSS fix.
• The author warned that running autonomous coding agents outside of a sandbox poses severe security risks.

Developers must sandbox autonomous coding agents to prevent them from executing arbitrary local commands, taking screenshots, or spinning up unauthorized local servers.

SOURCES

[1]

7. NanoClaw and JFrog Launch Security Integration for AI Agents

NanoClaw has partnered with JFrog to launch a security integration designed to protect autonomous agents from malicious code injection. The integration forces agents to pull software dependencies exclusively from vetted JFrog registries. If an agent attempts to download a compromised library in the background, the registry blocks the installation with a 403 security policy error and guides the agent to an approved version. This addresses the growing security risk of autonomous agents installing unverified packages without human oversight.

• The integration forces NanoClaw agents to pull software dependencies exclusively from vetted JFrog registries.
• If an agent attempts to download a compromised library, the registry blocks the installation with a 403 security policy error.
• The service is free for the open-source community, with commercial routing available for enterprises.
• NanoCo AI has also established partnerships with Vercel for permissions and Docker for containerized isolation.

Developers can secure autonomous coding agents by forcing them to pull dependencies exclusively from vetted registries, blocking malicious package injections.

SOURCES

[1]

8. SkillSpector Launches to Scan AI Agent Skills for Vulnerabilities

SkillSpector has been released as an open-source security scanner under the Apache License 2.0, addressing research showing that 26.1% of AI agent skills contain vulnerabilities and 5.2% exhibit malicious intent. The tool analyzes agent skills from Git repositories, URLs, zip files, or local directories using a two-stage process: fast static analysis followed by an optional LLM-based semantic evaluation. It scans for 64 vulnerability patterns across 16 categories, integrates with OSV.dev for real-time CVE lookups, and outputs detailed risk reports in multiple formats including SARIF and JSON.

• SkillSpector is an open-source security scanner released under the Apache License 2.0.
• The tool scans for 64 vulnerability patterns across 16 categories, including prompt injection and data exfiltration.
• It uses a two-stage analysis process combining fast static analysis with optional LLM-based semantic evaluation.
• The scanner integrates with OSV.dev for real-time CVE lookups and includes an automatic offline fallback.
• It outputs a 0-100 risk score with severity labels in terminal, JSON, Markdown, or SARIF formats.

Developers building or using agentic ecosystems can automatically audit third-party agent skills for prompt injection, data exfiltration, and privilege escalation.

SOURCES

[1] [2]

9. Autonomous Security Agent Discovers 21 Zero-Days in FFmpeg

Depthfirst's autonomous security agent has discovered 21 zero-day vulnerabilities in the FFmpeg software library, costing just $1,000 in API spend—one-tenth of what Anthropic spent using its Mythos model for a similar analysis. Eight of the vulnerabilities have been assigned CVE identifiers (CVE-2026-39210 through CVE-2026-39217), affecting critical components like the TS demuxer, VP9 decoder, and multiple RTP depacketizers. Depthfirst also developed a proof-of-concept remote code execution exploit that targets the MPEG-4 RTP depacketizer during the unauthenticated RTSP PLAY phase, highlighting the immediate need for developers using FFmpeg to audit and patch their media pipelines.

• Depthfirst's autonomous security agent identified 21 zero-day vulnerabilities in the FFmpeg software library.
• The analysis cost approximately $1,000, which is 10% of the cost Anthropic spent using Mythos for similar analysis.
• Eight vulnerabilities have been assigned CVE identifiers (CVE-2026-39210 through CVE-2026-39217).
• Vulnerabilities affect components including the TS demuxer, VP9 decoder, and multiple RTP depacketizers.
• Depthfirst developed a proof-of-concept remote code execution exploit triggered during the RTSP PLAY phase requiring no authentication.

Developers using FFmpeg for audio/video processing must patch their systems, as these vulnerabilities include remote code execution exploits.

SOURCES

[1]

10. Architect-Loop Reduces Claude Fable Token Costs by 80%

The open-source `architect-loop` project has introduced a multi-agent orchestration pattern that reduces Claude Fable token consumption by 80%. The system designates Claude Fable as an "architect" to design tasks, write acceptance gates, and review code, while delegating the actual building and research execution to GPT-5.5 Codex. Builders operate in isolated git worktrees restricted to declared files, and the entire loop runs on existing flat-rate subscriptions for Claude Code and the Codex CLI, eliminating the need for additional API keys or token bills.

• The architect-loop project uses Claude Fable as an architect and GPT-5.5 Codex as a builder to execute tasks.
• The system reduces Fable token usage by 80% by restricting builders to isolated git worktrees.
• It runs on existing flat-rate subscriptions for Claude Code and the Codex CLI, requiring no additional API keys.
• The build loop (/architect) has Fable spec a slice, split it into lanes, and commit acceptance gates before builders execute.
• The system uses git history and specific documentation files as its primary memory.

Developers can dramatically lower their API bills by using a high-tier model solely for architecture and review while delegating execution to cheaper models.

SOURCES

[1]

11. Open-Source CLI Tool 'erm' Automatically Removes Audio Disfluencies

A new open-source command-line tool called `erm` has been released on GitHub to automate the removal of spoken disfluencies like "um", "uh", and "er" from English audio recordings. Built on top of the faster-whisper implementation of OpenAI's Whisper model, the tool runs a four-pass detection pipeline to locate fillers, including those hidden in silent gaps or merged with adjacent words. To prevent audio artifacts, `erm` slides cut points to quiet spots, snaps them to zero-crossing points, applies dynamic crossfades via ffmpeg, and loops a sample of the recording's original room tone to maintain consistent background noise.

• erm is a command-line tool that automatically removes disfluencies like "um", "uh", and "er" from spoken English audio.
• The tool utilizes the faster-whisper implementation of OpenAI's Whisper model for transcription and token identification.
• It performs four distinct passes to detect fillers, including analyzing silent gaps and fillers glued to adjacent words.
• Splicing is handled via ffmpeg with dynamically scaled crossfades and zero-crossing alignment to prevent audio clicks.
• The tool is installable via pip or uvx and requires ffmpeg and ffprobe on the host system.

Developers building voice, speech, or podcasting features can integrate this tool to programmatically clean up audio recordings and remove filler words.

SOURCES

[1]

12. EAGLE3 Speculative Decoding Model Merged into llama.cpp

Following six months of development, the EAGLE3 model has been merged into the main `llama.cpp` repository. EAGLE3 functions as a helper model designed to accelerate local inference speeds. Unlike Multi-Token Prediction (MTP) architectures that operate independently, EAGLE3 utilizes active guidance from the main model to perform speculative decoding, offering a highly integrated path for local performance optimization.

• The EAGLE3 model has been merged into the main llama.cpp repository after six months of development.
• EAGLE3 acts as a helper model that receives guidance from the main model during inference.
• Unlike Multi-Token Prediction (MTP), EAGLE3 utilizes active guidance from the main model rather than operating independently.

Developers running local LLMs can leverage EAGLE3 within llama.cpp to significantly accelerate local inference speeds.

SOURCES

[1]

13. PixelRAG Replaces Text Parsing with Screenshot-Based Indexing

Researchers from UC Berkeley, Princeton, EPFL, and Databricks have introduced PixelRAG, a novel RAG pipeline that replaces traditional text parsing with screenshot-based indexing and vision-language model reading. By rendering web pages as screenshots, PixelRAG preserves visual layouts, tables, and typography that are typically lost during HTML-to-text conversion. Built using Playwright, Qwen3-VL-Embedding-2B, and FAISS, the system achieves up to 18.1% higher accuracy across six benchmarks and delivers a 10x reduction in agent token costs compared to text-based alternatives.

• PixelRAG renders web pages as screenshots to preserve layout, typography, and tables.
• The system outperformed text-based RAG across six benchmarks, achieving up to 18.1% higher accuracy.
• It uses Playwright for rendering, Qwen3-VL-Embedding-2B for vector encoding, and a FAISS index for retrieval.
• PixelRAG provides a 10x reduction in agent token usage compared to text-based retrieval systems.
• Training the retrieval model using LoRA takes under three hours on a single H100 GPU.

Developers can bypass fragile HTML-to-text parsing in RAG pipelines, cutting agent token costs by 10x while improving retrieval accuracy.

SOURCES

[1]

14. Smart PDFs Embed Structured Markdown for Machine Extraction

A new "Smart PDF" technique leverages a standard PDF specification property dating back to PDF 1.4 to embed structured markdown directly into documents. While standard PDF renderers ignore this metadata and display the visual layout to humans, text extractors like PyMuPDF and Poppler read the replacement text property instead of the visual glyph coordinates. This allows LLMs like ChatGPT and Claude to instantly extract clean markdown with high information density, bypassing fragile parsing pipelines with only a single-digit percentage increase in file size.

• The technique utilizes a standard PDF specification property (available since version 1.4) to define replacement text for marked content.
• PDF renderers display the visual layout to humans, while text extractors return the embedded markdown.
• Major open-source extractors like PyMuPDF and Poppler honor the replacement text property.
• ChatGPT and Claude successfully extract and return the embedded markdown when processing these files.
• The size overhead for creating these "smart PDFs" is in the single-digit percentage range.

Developers can eliminate complex PDF parsing pipelines by generating documents that natively expose clean markdown to LLMs and extractors.

SOURCES

[1]

15. Google Researchers Introduce 'Faithful Uncertainty' to Align LLM Confidence

Google researchers have introduced "faithful uncertainty," a metacognitive technique designed to align an LLM's linguistic expression of doubt with its internal statistical confidence. This approach addresses the "utility tax" of strict zero-hallucination standards, which often force models to discard up to 52% of correct answers just to lower error rates. By allowing models to express hedged hypotheses rather than defaulting to a binary answer-or-abstain choice, faithful uncertainty acts as a dynamic control layer for agentic applications, helping systems decide exactly when to trigger external tools or search APIs based on internal confidence.

• Faithful uncertainty aligns an LLM's linguistic expression of doubt with its internal statistical confidence.
• The technique allows models to provide hedged hypotheses instead of defaulting to an unhelpful answer-or-abstain binary.
• Data shows that reducing a 25% error rate to a 5% target by forcing strict zero-hallucination standards discards 52% of correct answers.
• In agentic applications, it acts as a control layer to determine when to trigger external tools or search APIs.
• Implementing the technique via supervised fine-tuning faces a bootstrapping paradox because the ground truth for uncertainty is dynamic.

Developers can build more reliable agents that dynamically decide when to trigger external tools or search APIs based on their internal confidence, reducing silent hallucinations.

SOURCES

[1]

16. Scaling Test-Time Compute Scaffold for Qwen and Gemma Models

A new open-source scaffold has been released to scale test-time compute for Qwen-3.6-27B and Gemma-4-31B, enabling them to surpass Claude Mythos on code optimization tasks. The system uses 25 to 40 times more compute than baseline models by employing a branches exploration breadth of 5, an iterative corrections loop depth of 10, and 6 branch-aware selective hypotheses revised every 2 iterations. To prevent models from getting stuck in local minima, the scaffold injects structured noise into the corrections loop and provides agents with a local Python environment to programmatically verify their work.

• The scaffold uses 25-40x more compute than baseline models to solve complex optimization problems.
• It features a branches exploration breadth of 5, an iterative corrections loop depth of 10, and 6 branch-aware hypotheses.
• A solution pool adds structured noise to the iterative corrections loop to prevent models from getting stuck in local minima.
• Agents are given access to a Python environment to programmatically verify their code improvements.
• The project is hosted on GitHub at github.com/ryoiki-tokuiten/Iterative-Contextual-Refinements.

Developers can use this iterative refinement scaffold to significantly boost the coding performance of open-weights models.

SOURCES

[1]

17. Artificial Analysis Updates Coding Agent Index with DeepSWE Benchmark

Artificial Analysis has updated its Coding Agent Index, swapping out the SWE-Bench Pro benchmark for Datacurve's DeepSWE benchmark. DeepSWE addresses concerns that previous benchmarks were gameable via repository commit history by generating evaluation tasks entirely from scratch. Under the new, more rigorous evaluation, Claude Code with Fable 5 (max) took the top spot with a score of 77, followed closely by Codex with GPT-5.5 (xhigh) at 76, and Claude Code with Opus 4.8 (max) at 73.

• Artificial Analysis replaced SWE-Bench Pro with Datacurve's DeepSWE benchmark in its Coding Agent Index.
• DeepSWE generates tasks from scratch to prevent models from accessing solutions in their training data.
• Claude Code with Fable 5 (max) debuted at the top of the updated index with a score of 77.
• Codex with GPT-5.5 (xhigh) rose to 76, while Claude Code with Opus 4.8 (max) scored 73.
• DeepSWE is highly difficult, with leading open-weights models scoring below 20.

Developers can better evaluate coding agents using a benchmark that generates tasks from scratch to prevent models from gaming evaluations via commit history.

SOURCES

[1]

1. Moonshot AI Releases Kimi K2.7-Code with 30% Thinking Token Reduction

2. MiniMax Releases MiniMax-M3 Open-Weights Model and Sparse Attention Kernel

3. Zyphra Releases Zamba2-VL Hybrid Mamba2-Transformer Vision Models

4. PaddleOCR Releases PP-OCRv6 Model Series

5. Benchmarks Reveal 4x Speedup but 6x More Errors in DiffusionGemma

6. Claude Fable 5 Case Study Highlights Unsandboxed Agent Risks

7. NanoClaw and JFrog Launch Security Integration for AI Agents

8. SkillSpector Launches to Scan AI Agent Skills for Vulnerabilities

9. Autonomous Security Agent Discovers 21 Zero-Days in FFmpeg

10. Architect-Loop Reduces Claude Fable Token Costs by 80%

11. Open-Source CLI Tool 'erm' Automatically Removes Audio Disfluencies

12. EAGLE3 Speculative Decoding Model Merged into llama.cpp

13. PixelRAG Replaces Text Parsing with Screenshot-Based Indexing

14. Smart PDFs Embed Structured Markdown for Machine Extraction

15. Google Researchers Introduce 'Faithful Uncertainty' to Align LLM Confidence

16. Scaling Test-Time Compute Scaffold for Qwen and Gemma Models

17. Artificial Analysis Updates Coding Agent Index with DeepSWE Benchmark

Inference Brew in your inbox