Audesso | Daily: AI

Fable-5 and Kimi-K2.7-Code Top Autoresearch Benchmarks

00:00 / --:--

← Back to home

Fable-5 and Kimi-K2.7-Code Top Autoresearch Benchmarks

1. Fable-5 and Kimi-K2.7-Code Top Autoresearch Benchmarks

A new benchmark evaluating seven frontier models across three categories of autonomous research tasks—ML engineering, harness/prompt engineering, and algorithmic discovery—has named Anthropic's Fable-5 the overall winner, even when factoring in cost constraints. However, for developers focused specifically on ML engineering, the open-weights model Kimi-K2.7-Code outperformed all tested frontier models, making it a strong candidate for local or specialized coding pipelines.

  • Seven frontier models were benchmarked across three autoresearch categories: ML engineering, harness/prompt engineering, and algorithmic discovery.
  • Anthropic's Fable-5 was the overall winner of the benchmark, even under cost constraints.
  • The open-weights model Kimi-K2.7-Code outperformed frontier models specifically in the ML engineering category.

Developers building autonomous research or advanced coding agents can use these benchmarks to select the most capable model for algorithmic discovery and ML engineering tasks.

SOURCES

2. Benchmarking Nemotron Super 120B Against Qwen and GPT-OSS

Local benchmarks run on a Strix Halo 128GB shared memory system compare the performance of Nemotron Super 120B against GPT-OSS 120B, Qwen 3.5 122B, and Qwen 3.6 35B. The results show that Nemotron Super excels at prompt processing, outperforming GPT-OSS 120B at 32K context and Qwen 3.5 122B at 16K context. However, while Nemotron Super supports a massive 400K context window, its token generation speed degrades to barely usable levels at maximum depth, making the smaller Qwen 3.6 35B a highly competitive alternative for general use.

  • Benchmarks were conducted on a Strix Halo 128GB shared memory system running Ubuntu 26.04 and Lemonade Server.
  • Models compared: GPT-OSS 120B, Qwen 3.5 122B, Nemotron Super 120B, and Qwen 3.6 35B.
  • Nemotron Super has a maximum context depth of 400K, compared to 128K for GPT-OSS and 256K for Qwen 3.5/3.6.
  • Nemotron Super surpasses GPT-OSS 120B in prompt processing speed at 32K context, and Qwen 3.5 122B at 16K context.
  • Nemotron Super token generation speed starts above 10 TPS and degrades to barely usable levels at 400K context depth.

Developers choosing a local model under 120B can use these benchmarks to balance prompt processing speed against generation latency at deep context lengths.

SOURCES

3. The Rise of Standardized Agent Protocols: MCP, ACP, A2A, and ANP

The AI agent ecosystem is consolidating around four major protocols released between late 2024 and early 2025. Anthropic's Model Context Protocol (MCP) has seen massive adoption, with the Linux Foundation reporting over 10,000 active public servers and 164 million monthly Python SDK downloads by April 2026. While application-layer protocols like Google's Agent2Agent (A2A) and IBM's Agent Communication Protocol (ACP) solve coordination and messaging, the underlying transport layer remains a bottleneck, still relying on HTTP and requiring relay infrastructure for agents behind NAT.

  • Four significant agent protocols were published between late 2024 and early 2025: MCP, ACP, A2A, and ANP.
  • Anthropic's Model Context Protocol (MCP) reached over 10,000 active public servers and 164 million monthly Python SDK downloads by April 2026.
  • Google's Agent2Agent (A2A) task coordination interface was donated to the Linux Foundation in June 2025.
  • IBM Research's Agent Communication Protocol (ACP) and the independent Agent Network Protocol (ANP) address messaging and discovery.
  • Current protocols rely on HTTP, leaving the transport layer for agent networks 18 to 24 months behind application-layer protocols.

Developers building multi-agent systems can leverage emerging open standards to ensure interoperability, tool-calling compatibility, and structured coordination.

SOURCES

4. Managing LLM Performance Degradation Beyond 100k Tokens

Despite massive advertised context windows, empirical studies like RULER and Chroma's report show that LLM performance degrades significantly once context exceeds roughly 100,000 tokens. This 'dumb zone' is easily reached by coding agents during multi-file debugging sessions. To combat this, developers are moving away from relying on raw context size and are instead adopting 'breadcrumb' workflows—using tools like obra/superpowers or mattpocock/skills to structure agent tasks around small, named artifacts like specs and PRDs.

  • LLM context windows exhibit a 'smart zone' and a 'dumb zone' starting around 100,000 tokens.
  • Studies like RULER and Chroma's report confirm that effective context is smaller than advertised.
  • Coding agents quickly hit the 100,000-token limit during file reads and debugging tasks.
  • Tools like Claude Code use auto-compaction to summarize history, but often after the model has already degraded.
  • Developers are adopting a 'breadcrumb approach' using tools like obra/superpowers to structure workflows around small, named artifacts.

Developers building coding agents and RAG pipelines must design workflows that keep critical context under 100k tokens to avoid severe model degradation.

SOURCES

5. Building a Local, Real-Time Voice-to-Voice Chatbot

A developer has successfully built a local, real-time voice-to-voice chatbot that supports Server-Sent Events (SSE) streaming and natural conversation interruptibility. The system is powered by Qwen3.5-397B (using Unsloth's UD-Q3_K_XL quantization), Whisper-small for speech-to-text, and Orpheus TTS with a custom SNAC decoder on ONNX. Running on a single 24 GB GPU, the setup utilizes 21.3 GB of VRAM and requires 150 GB of system RAM to handle Qwen's MoE experts, maintaining a 131k token context window.

  • The local chatbot supports SSE streaming, interruptibility, and conversation context.
  • It is powered by Qwen3.5-397B (UD-Q3_K_XL), Whisper-small STT, and Orpheus Q4_K_XL TTS with a custom SNAC decoder on ONNX.
  • The system requires approximately 21.3 GB of VRAM on a 24 GB GPU and 150 GB of system RAM for Qwen's MoE experts.
  • The model runs with a bf16 KV cache and supports a context window of 131,072 tokens.

Developers can reference this architecture to build highly responsive, local voice agents that support natural conversation flow and interruption.

SOURCES

6. Heretic 1.4 Launches Grimoire for Local Model Reproducibility

The Heretic project has released version 1.4, introducing the Heretic Grimoire system to ensure local model reproducibility and resilience against platform takedowns. By utilizing lightweight 9 KB reproduce.json files, developers can restore models locally in about one minute without repeating multi-hour computations. The update also adds support for exporting LoRAs to minimize storage costs and transitions the project's infrastructure to decentralized hosting over IPFS.

  • Heretic version 1.4 introduces the Heretic Grimoire system for model reproducibility.
  • The system uses 9 KB reproduce.json files containing the necessary metadata to recreate models locally.
  • Model restoration takes about one minute and bypasses the original multi-hour computations.
  • The project has expanded to decentralized hosting, making release archives and signatures available over IPFS.
  • Heretic 1.4 also adds the capability to export a LoRA instead of a full model to reduce storage costs.

Developers can protect their workflows against Hugging Face model takedowns by maintaining lightweight, decentralized local backups of their fine-tuned models.

SOURCES

7. Running Gemma 4 12B Locally on Google Pixel 10 Pro

A community test has demonstrated the feasibility of running Google's Gemma 4 12B model entirely on-device on a Google Pixel 10 Pro. Utilizing llama.cpp within a Termux environment, the setup ran a quantized version of the model alongside a draft model for speculative decoding. Operating under a highly efficient 10-watt power envelope, the system achieved a prompt processing speed of 6.5 tokens per second and a generation speed of 1.3 tokens per second at a prompt depth of 10,000 tokens.

  • A user tested llama.cpp (v9639) on a Google Pixel 10 Pro using the Termux environment.
  • The setup ran the gemma-4-12b-it-UD-Q3_K_XL.gguf model with a draft model (mtp-gemma-4-12b-it.gguf).
  • The configuration utilized a 32,000 context window and q8_0 cache types.
  • At a prompt depth of 10,000 tokens, the system achieved 6.5 t/s prompt speed and 1.3 t/s generation speed.
  • The entire setup operated under a power draw of less than 10 watts.

Developers building on-device mobile AI applications can reference these power and token-throughput benchmarks for running 12B-class models on flagship mobile hardware.

SOURCES

8. Dual DGX Spark Benchmarks for DeepSeek-V4-Flash MoE

A new open-source guide and benchmark suite on GitHub outlines how to run DeepSeek-V4-Flash MoE models on a dual DGX Spark configuration. By linking two units with a $180 cable to achieve 200G/s over ConnectX-7, the setup achieves 41 t/s decode and 1785 t/s prefill using vLLM FP8. The benchmarks also compare performance against single-stream setups, showing the RTX Pro 6000 reaching 46.9 t/s decode and a Mac Studio M2 Ultra reaching 29.7 t/s decode.

  • A new guide and benchmark suite on GitHub details running DeepSeek-V4-Flash MoE on two DGX Spark units.
  • The setup requires a $180 cable to achieve 200G/s over ConnectX-7.
  • Using vLLM FP8, the dual-unit configuration achieves 41 t/s decode and 1785 t/s prefill speed.
  • The dual-unit setup reaches 350 aggregate t/s with 32 concurrent requests at 256k context each.
  • Single-stream benchmarks show the RTX Pro 6000 achieving 46.9 t/s decode and the Mac Studio M2 Ultra achieving 29.7 t/s decode.

Developers looking to self-host DeepSeek-V4-Flash can reference concrete multi-GPU and single-stream hardware benchmarks to plan their local deployment infrastructure.

SOURCES

9. Run DeepSeek-V4-Flash on Mac Using SSD Streaming

Antirez's ds4 engine introduces an --ssd-streaming flag that allows developers to run models larger than their physical RAM on local hardware. Tested on an M3 Max with 96GB of RAM, the engine successfully runs DeepSeek-V4-Flash at 11-13 tokens per second. While the cold boot time to first token is 3-5 seconds and prefilling 36,000 tokens takes 2.5 minutes, the technique opens up local testing of massive models on standard developer workstations.

  • Antirez's ds4 engine allows running machine learning models larger than available RAM using the --ssd-streaming flag.
  • On an M3 Max 96GB system, the engine sustains a performance of 11-13 tokens per second.
  • The time to first token is approximately 3-5 seconds after a cold boot.
  • Prefilling 36,000 tokens takes approximately 2 minutes and 30 seconds.

Developers can run models larger than their system's physical RAM on local Apple Silicon hardware, albeit with a performance trade-off.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.