Multi-Token Prediction Merged into llama.cpp

1. Multi-Token Prediction Merged into llama.cpp

The llama.cpp project has integrated Multi-Token Prediction (MTP) into its master branch. This update allows the inference engine to predict multiple future tokens simultaneously, potentially improving generation throughput for supported models. Developers can now leverage this feature to optimize performance in local LLM deployments.

• Pull request 22673 has been merged into the master branch.
• MTP support is now officially part of the llama.cpp codebase.
• Developers can expect improved generation throughput for models trained with MTP capabilities.

MTP is a significant optimization for local inference, offering a path to higher token generation speeds without requiring larger hardware footprints.

SOURCES

[1] [2]

2. Repowise for Repository-Level Code Intelligence

Repowise enables developers to build a deeper understanding of their codebases by indexing repositories and performing graph-based analysis. Using tools like NetworkX, it calculates PageRank scores to identify key components and detects dead code. It also supports the generation of CLAUDE.md files to provide AI agents with better context for development tasks.

• Supports graph analysis to identify architectural dependencies.
• Includes dead-code detection and architectural decision tracking.
• Generates CLAUDE.md files to improve AI agent performance on specific codebases.

As AI agents take on more coding tasks, providing them with accurate, repository-wide context is critical for reducing hallucinations and improving code quality.

SOURCES

[1]

3. Frontier AI Models Disrupting CTF Competitions

The rise of advanced AI models like Claude Opus 4.5 and GPT-5.5 has enabled the automation of medium and hard CTF challenges, shifting the competitive landscape from human skill to AI orchestration. Security experts argue that public leaderboards are no longer reliable measures of human capability, as agents can solve complex challenges with minimal intervention.

• AI models can now solve medium and hard CTF challenges with minimal human input.
• The CTFTime leaderboard is no longer considered a reliable metric for human security skill.
• Security practitioners are shifting toward educational platforms like picoGym and HackTheBox.

This shift forces a re-evaluation of how security skills are measured and validated, as traditional competitive formats are increasingly vulnerable to AI-driven automation.

SOURCES

[1]

4. NVIDIA Releases SANA-WM World Model

SANA-WM is a new open-source world model that generates minute-long, 720p videos using a single image and a 6-DoF camera trajectory. The model utilizes a hybrid architecture with Gated DeltaNet blocks to maintain a constant recurrent state, allowing for efficient video generation. It is available under an Apache 2.0 license and can generate clips in under a minute on high-end consumer hardware.

• Generates 60-second, 720p video from a single image.
• Features a hybrid architecture for efficient recurrent state management.
• Available under an Apache 2.0 license via the NVlabs/Sana repository.

This release provides developers with a high-performance, open-source tool for video generation and world modeling, significantly lowering the barrier for creating long-form synthetic video content.

SOURCES

[1]

5. DeepSeek-V4-Flash and Local LLM Steering

DwarfStar 4, a version of llama.cpp, allows developers to run DeepSeek-V4-Flash locally with built-in steering functionality. By manipulating internal numerical activations during inference, users can guide model behavior. While steering offers a way to influence outputs, it remains a niche technique compared to standard prompt engineering.

• Steering requires direct access to model activations, limiting it to open-weights models.
• DwarfStar 4 integrates steering directly into the llama.cpp inference workflow.
• Most steering applications are currently outcompeted by simpler prompt engineering techniques.

Direct activation steering provides a powerful, albeit complex, method for controlling model behavior that is only possible with open-weights models.

SOURCES

[1]

6. AI Coding Agents Targeted at Pwn2Own 2026

The Pwn2Own Berlin 2026 event highlighted the growing attack surface of AI-integrated developer tools. Researchers earned significant bounties for discovering zero-day exploits in the Cursor AI coding agent and OpenAI's Codex. These findings underscore the security risks inherent in deploying AI agents that interact with local development environments.

• Zero-day vulnerabilities were identified in Cursor AI and OpenAI Codex.
• Researchers earned $50,000 in total for AI-specific exploits.
• The event reinforces the need for security audits of AI-integrated developer platforms.

As AI coding agents gain deeper access to local files and systems, they become high-value targets for attackers, necessitating more robust security practices for AI-native tooling.

SOURCES

[1]

7. LiteLLM Agent Platform Released

The LiteLLM Agent Platform offers a self-hosted infrastructure layer designed to manage multiple AI agents in production. It provides per-team and per-context sandbox isolation, ensuring session continuity across pod restarts. The platform integrates with the existing LiteLLM AI Gateway to handle model routing and cost tracking while adding persistent storage and runtime management.

• Provides isolated runtime environments for agent sessions.
• Ensures session continuity across pod restarts and upgrades.
• Built on Kubernetes and integrates with the LiteLLM AI Gateway.

This platform addresses the operational challenges of scaling AI agents in production, specifically regarding isolation, persistence, and infrastructure management.

SOURCES

[1]

8. Lighthouse Attention for Long-Context Pretraining

Lighthouse Attention reduces the computational cost of scaled dot-product attention by using a multi-level pyramid to pool queries, keys, and values. This approach lowers attention complexity from O(N²d) to O(S²d), resulting in significant wall-clock speedups during pretraining. The method is designed for training-only use, allowing models to be resumed under dense attention for inference compatibility.

• Reduces attention complexity from O(N²d) to O(S²d).
• Delivers 1.4–1.7× end-to-end speedup during pretraining.
• Compatible with dense attention for inference after training.

Efficient long-context pretraining is a major bottleneck for modern LLMs; this method provides a way to scale to 1M+ tokens without the prohibitive costs of standard dense attention.

SOURCES

[1]

1. Multi-Token Prediction Merged into llama.cpp

2. Repowise for Repository-Level Code Intelligence

3. Frontier AI Models Disrupting CTF Competitions

4. NVIDIA Releases SANA-WM World Model

5. DeepSeek-V4-Flash and Local LLM Steering

6. AI Coding Agents Targeted at Pwn2Own 2026

7. LiteLLM Agent Platform Released

8. Lighthouse Attention for Long-Context Pretraining

Daily AI signal in your inbox