Audesso | Daily: AI

Google Releases Gemma 4 Quantization-Aware Training Checkpoints

00:00 / --:--

← Back to home

Google Releases Gemma 4 Quantization-Aware Training Checkpoints

1. Google Releases Gemma 4 Quantization-Aware Training Checkpoints

Google DeepMind's new Quantization-Aware Training (QAT) checkpoints simulate quantization during training to minimize precision loss. The release includes a Q4_0 format and a specialized mobile schema that optimizes embedding and KV cache layers. These models are available on Hugging Face in GGUF and compressed tensor formats, compatible with popular local runtimes like llama.cpp, Ollama, and vLLM.

  • Google DeepMind released Gemma 4 QAT checkpoints in Q4_0 and a specialized mobile format.
  • The Q4_0 format reduces the Gemma 4 E2B model's memory footprint to 3.2 GB and the E4B model to 5 GB.
  • The mobile QAT schema reduces the E2B model to under 1 GB using static activations, channel-wise quantization, and targeted 2-bit compression.
  • The checkpoints are available on Hugging Face with support for llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM.
  • Performance testing on an AMD 7900 XTX showed a 45% reduction in generation time and 5.7GB VRAM savings for the 12B QAT model compared to Q8_0.

Developers can run Gemma 4 models locally with significantly reduced VRAM requirements and minimal quality loss compared to standard post-training quantization.

2. Open-Weights Explosion: 25+ Notable Models Released Across Modalities

A remarkable week for open-source AI saw over 25 notable open-weight releases. Highlights include NVIDIA's massive 550B Nemotron 3 Ultra, Google's dense any-to-any Gemma 4 12B, and StepFun's Step-3.7-Flash. Edge developers also received new options like Liquid AI's LFM2.5-8B-A1B and RedNote's dots.tts pipeline.

  • NVIDIA released Nemotron 3 Ultra (550B hybrid Mamba-MoE, 1M context) and Nemotron-3.5 ASR (600M streaming model).
  • Google released Gemma 4 12B (dense any-to-any, 256k context, 140+ languages).
  • StepFun released Step-3.7-Flash (198B sparse MoE VLM, Apache 2.0).
  • Liquid AI released LFM2.5-8B-A1B (edge MoE, 1.5B active parameters, MLX compatible).
  • Other releases include Ideogram 4 (9.3B flow-matching DiT), RedNote dots.tts, and NVIDIA Cosmos3-Super (64B omnimodal world model).

This consolidated list provides developers with a quick reference to the latest self-hostable models, including massive hybrid models and specialized edge models.

SOURCES

3. Fixing Gemma 4 12B Tool Calling and Coding Failures

While developers initially reported that Gemma 4 12B frequently failed tool calls in evaluation harnesses, a community-discovered fix resolves the issue. By compiling llama.cpp from source and applying a custom chat template via the --jinja and --chat-template-file flags, developers can restore proper tool calling. This enables reliable local deployment of the model for agentic workflows.

  • Users reported frequent tool call failures with Gemma 4 12B, preventing its use in harnesses like OpenCode.
  • The fix requires compiling llama.cpp from source and using the --jinja and --chat-template-file flags with a custom template.
  • A developer reported achieving 50 tokens per second with the Unsloth Q5_K_XL model (8.6GB) using a 32k context window and Q8 KV cache.
  • Google AI Edge is also enabling local deployment of Gemma 4 12B on laptops for agentic workflows.

Applying this custom template allows developers to successfully evaluate and use Gemma 4 12B for local agentic workflows and coding tasks without tool call failures.

4. Offloading KV Cache to System RAM with llama.cpp

The llama.cpp -nkvo (no KV offload) option allows developers to offload the KV cache to system RAM rather than VRAM. In testing with a Qwen3.6 27B model on a 16GB GPU, this option allowed the entire model to fit on the GPU with an f16 KV cache, expanding the context window to 128k. The performance trade-off was minimal, dropping from 23 tps to 19 tps at peak.

  • The -nkvo (--no-kv-offload) option in llama.cpp offloads the KV cache to system RAM instead of VRAM.
  • Testing with Qwen3.6 27B on an RTX 5060 Ti (16GB) and DDR5 RAM enabled a 128k context window by keeping 63 layers on the GPU.
  • Enabling -nkvo achieved 19 tps peak and 14 tps during long generation, compared to 23 tps peak and 16 tps with quantized q4_0 KV cache on GPU.
  • Quantizing the KV cache when offloaded to RAM provided no performance improvement and occasionally degraded performance.

Developers can dramatically increase context windows (up to 128k) on limited VRAM GPUs by offloading the KV cache to DDR5 RAM instead of quantizing it.

SOURCES

5. OpenLumara: A Modular, Token-Efficient Local AI Agent Framework

OpenLumara is a modular, open-source AI agent framework built from scratch for local models. Unlike "vibecoded" frameworks, it focuses on token efficiency, allowing developers to disable modules to shrink the system prompt from 4k to under 1k tokens. It features a sandboxed shell environment via Docker or Podman, automatic sensitive data masking, and a coder module that targets specific functions or classes.

  • OpenLumara is designed for local models, llama.cpp, and koboldcpp, and is licensed under GPL2.
  • The default system prompt is ~4k tokens but can be reduced to under 1k tokens by disabling unused modules.
  • Security features include a sandboxed shell environment using Docker/Podman and automatic masking of sensitive data.
  • The coder module targets specific functions or classes in code files rather than using search-and-replace.
  • It includes a web-based UI, a CLI mode, and is integrated into the esobold fork of koboldcpp.

It offers a highly modular, token-efficient alternative to heavy agent frameworks, allowing developers to shrink system prompts to under 1k tokens.

SOURCES

6. Alibaba Open-Sources Open Code Review CLI Tool

Alibaba has open-sourced Open Code Review, an Apache-2.0 licensed CLI tool used internally for two years to identify millions of code defects. The tool reads Git diffs and sends changed files to a configurable LLM to generate structured, line-level comments. It can be integrated into CI/CD pipelines, used as a slash command in coding agents, and includes a local viewer for browsing session history.

  • Open Code Review is an open-source, Apache-2.0 licensed CLI tool developed and used internally by Alibaba.
  • It reads Git diffs and sends changed files to a configurable LLM to generate structured review comments with line-level precision.
  • The architecture combines deterministic engineering (file selection/bundling) with an agent for dynamic decision-making.
  • It supports integration into CI/CD pipelines, works as a slash command in AI coding agents, and includes a local viewer for history.
  • Installation is supported via NPM, GitHub binary releases, or building from source.

Developers can integrate this tool into their local workflows, coding agents, or CI/CD pipelines to automate code reviews using configurable LLMs.

SOURCES

7. Microsoft Announces IQ Context Layer and Agent Optimizer at Build 2026

At its Build 2026 conference, Microsoft announced several tools for enterprise agent development. The Microsoft IQ context layer provides secure data access across Fabric, Foundry, Web, and Work data sources. Additionally, Microsoft launched the Agent Optimizer, which uses rubric-based evaluation to automate prompt modifications, and enabled agent identity via the Entra system.

  • The Microsoft IQ suite includes Fabric IQ, Foundry IQ, Web IQ, and Work IQ (APIs releasing June 16).
  • The Agent Optimizer tool uses rubric-based evaluation to provide granular feedback and automated prompt modifications.
  • Microsoft is enabling agent identity through the Entra system, giving agents their own email and Teams access.
  • Microsoft also introduced Scout, a personal work agent built on open-source OpenClaw technology.

These enterprise-focused tools provide structured context, automated prompt modifications, and agent identity management for developers building on Microsoft's ecosystem.

SOURCES

8. Microsoft Open-Sources pg_durable for In-Database Workflows

Microsoft has open-sourced pg_durable, a PostgreSQL extension designed for in-database durable execution. Built using the pgrx framework and Rust, the extension allows developers to define long-running, fault-tolerant workflows using a SQL-based DSL. By managing state and retries natively within PostgreSQL 17 or 18, pg_durable eliminates the need for external queues or workers.

  • pg_durable is a PostgreSQL extension (currently in preview) that manages state and retries natively.
  • It eliminates the need for external cron jobs, workers, or queues.
  • The system uses a SQL-based DSL with operators like ~> and |=> to define workflows.
  • Built using the pgrx framework, it relies on Rust libraries duroxide and duroxide-pg.
  • Requires PostgreSQL 17 or 18 and must be added to shared_preload_libraries.

Developers can build durable, agentic, or transactional workflows that survive crashes and restarts natively in PostgreSQL without external queue infrastructure.

SOURCES

9. Optimizing Qwen 3.6 MoE on an 8GB VRAM Laptop GPU

A developer successfully ran the Qwen3.6-35B-A3B MoE model on an 8GB VRAM laptop GPU by offloading experts to the CPU. Key optimizations included using --no-mmap to prevent page faults and maintaining 1.5GB of VRAM headroom. Surprisingly, speculative decoding with a Qwen3.5-0.8B draft model provided a 26% speedup, contradicting full-GPU benchmarks where speculative decoding is often net-negative.

  • The setup offloaded MoE experts to the CPU, using --no-mmap and maintaining 1.5GB VRAM headroom to avoid Windows system memory fallback.
  • Speculative decoding using a Qwen3.5-0.8B draft model provided a 26% speed increase, achieving ~39 tps.
  • K-quants outperformed i-quants for CPU-offloaded experts due to optimized CPU kernels.
  • TurboQuant, Flash Attention, and i-quants provided no benefit or decreased performance due to the hybrid architecture.

This demonstrates a viable configuration for running large MoE models on consumer hardware, achieving 39 tokens per second with a 26% speedup from speculative decoding.

SOURCES

10. NVIDIA Introduces Dynamo Snapshot for Fast Kubernetes AI Startup

NVIDIA's Dynamo Snapshot is a checkpoint/restore system designed to eliminate cold-start latency for AI inference on Kubernetes. By combining cuda-checkpoint for GPU state and CRIU for host process state, the system serializes running containers. It utilizes CUDA Virtual Memory Management to unmap the KV cache, shrinking checkpoint sizes and enabling a gpt-oss-120b model to start in under 5 seconds.

  • Dynamo Snapshot uses cuda-checkpoint for GPU state and CRIU for host-side process state.
  • It is deployed as a privileged snapshot-agent DaemonSet without modifying the underlying runc container runtime.
  • KV cache unmap and release via CUDA Virtual Memory Management reduces checkpoint sizes (e.g., from 190 GiB to 6 GiB for Qwen3-0.6B).
  • In a proof-of-concept, it reduced startup time for a gpt-oss-120b model to under 5 seconds.
  • It currently requires x86_64 GPU nodes, NVIDIA driver 580.xx or newer, and supports vLLM workers in limited preview.

Developers deploying large models on Kubernetes can drastically reduce cold-start times and scale-up latency by serializing GPU and host process states.

SOURCES

11. Lowfat CLI Tool Filters Verbose Output to Save LLM Tokens

The open-source tool 'lowfat' is a pluggable CLI filter designed to reduce the verbosity of terminal outputs sent to AI agents. Operating as a local-first single binary, it acts as an agent hook or shell wrapper. It features a customizable plugin system for specific commands, helping developers avoid token limits on platforms like Amazon Bedrock.

  • 'lowfat' is a single-binary, local-first tool with no telemetry that functions as an agent hook or shell wrapper.
  • It features a plugin system to customize filters for specific commands and supports UNIX-style composable pipes.
  • The developer reported a 91.8% total token reduction over two months of personal use.
  • The tool helps avoid hitting token limits for services like Amazon Bedrock.

Developers can use this tool as an agent hook or shell wrapper to prevent coding agents from consuming excessive tokens on long CLI outputs.

SOURCES

12. KVarN KV-Cache Quantization Implemented in BeeLlama.cpp

A developer has implemented Huawei's KVarN KV-cache quantization method in a fork of llama.cpp called BeeLlama.cpp (v0.3.2 Preview). KVarN provides 3–5 compression of the KV cache, delivering q5 quality at 4-bit and q4 quality at 3.5-bit. The implementation currently supports Qwen 3.6 27B and Gemma 4 31B models on NVIDIA hardware.

  • KVarN is a Huawei-developed KV-cache quantization method offering 3–5 compression.
  • It is implemented in the BeeLlama.cpp v0.3.2 Preview release, supporting Qwen 3.6 27B and Gemma 4 31B.
  • Users can enable it using the --cache-type-k and --cache-type-v flags.
  • Benchmarks show KVarN delivers q5 quality at 4-bit and q4 quality at 3.5-bit, with higher precision than TurboQuant.

This implementation allows developers to run Qwen 3.6 27B and Gemma 4 31B with significantly reduced memory footprints while maintaining high precision.

SOURCES

13. Braintrust Launches Topics for Large-Scale Agent Trace Analysis

Braintrust has launched Topics, an intelligence layer designed to analyze production agent traces at scale. Standard NLP tools often break when processing million-token traces with hundreds of spans due to non-uniform document shapes. Topics solves this by using an LLM summary to make the analysis tractable, processing traces through a pipeline of preprocessing, embedding, clustering, and classification.

  • Braintrust founder Ankur Goyal introduced Topics, inspired by Anthropic's Clio paper.
  • The pipeline handles million-token traces with hundreds of spans that typically break standard NLP tools.
  • It processes data through preprocessing, faceting, embedding, clustering, naming, and classification.
  • The pipeline uses an LLM summary to avoid fitting raw traces into an embedding model's context window.

This allows developers to analyze million-token agent traces with hundreds of spans by using LLM summaries to make the data tractable for embedding and clustering.

SOURCES

14. RedNote Releases dots.tts 2B Open-Source Text-to-Speech Model

RedNote (Xiaohongshu) has released dots.tts, an open-source 2-billion-parameter text-to-speech model under the Apache 2.0 license. The model features a fully continuous architecture that bypasses both codec tokens and phoneme pipelines, synthesizing 48 kHz audio directly from text. It also supports zero-shot voice cloning.

  • dots.tts is a 2B parameter open-source TTS model released under the Apache 2.0 license.
  • It utilizes a fully continuous architecture that does not rely on codec tokens.
  • The model supports 48 kHz audio synthesis and zero-shot voice cloning.
  • It performs direct text-to-speech synthesis without a phoneme pipeline.

Developers can self-host a high-quality, Apache 2.0-licensed TTS model capable of 48 kHz audio synthesis without a phoneme pipeline.

SOURCES

15. Microsoft Fara Tutorial Demonstrates Browser-Use Agents in Colab

A new tutorial outlines how to run Microsoft Fara browser-use agents in Google Colab. By utilizing a mock OpenAI-compatible endpoint, developers can test and verify browser automation loops without deploying the full Fara-7B model. The setup clones the Fara repository, configures Playwright, and provides options to transition to real deployments via vLLM, LM Studio, or Azure Foundry.

  • The tutorial guides users through cloning the Fara repository, installing dependencies, and configuring Playwright.
  • It uses a mock OpenAI-compatible endpoint to test the agent loop, avoiding the need for a full Fara-7B deployment.
  • Configuration options allow switching to real Fara-7B deployments via Azure Foundry, vLLM, LM Studio, or Ollama.
  • The agent can be executed via fara-cli or the fara.run_fara Python module.

Developers can quickly test and verify browser automation agent loops in a sandboxed environment without deploying a full Fara-7B model.

SOURCES

16. llama.cpp Server Now Supports Under-30-Second Model Hot Swapping

The llama.cpp project has introduced a model hotswap API that allows developers to swap active models in under 30 seconds. This API is compatible with OpenWebUI and Hermes, offering a major performance improvement over older PyTorch-based swapping methods. Developers can deploy the server via Podman using the official CUDA 13 server image.

  • The llama.cpp model hotswap API is compatible with OpenWebUI and Hermes.
  • Model swapping performance is significantly faster than older PyTorch-based methods.
  • A podman command is available to run the server container using the ghcr.io/ggml-org/llama.cpp:server-cuda13 image.
  • The configuration supports a models preset file and a maximum model limit.

Developers running local LLM servers can dynamically switch models on the fly without restarting the container, improving resource utilization.

SOURCES

17. Unsloth Releases Gemma 4 MTP GGUF and QAT Weights

Unsloth has released Multi-Token Prediction (MTP) GGUF weights for Gemma 4 models on Hugging Face. The weights are available for the 31B, 26B-A4B, and 12B model sizes in Q8, F16, and BF16 formats. Additionally, Unsloth has published a collection of Gemma 4 QAT models and a corresponding technical guide.

  • Unsloth released MTP GGUF weights for Gemma 4 in 31B, 26B-A4B, and 12B sizes.
  • Available formats for the MTP GGUF weights include Q8, F16, and BF16.
  • Unsloth also published a collection of Gemma 4 QAT models on Hugging Face along with a technical guide.

This release provides developers with optimized, ready-to-run GGUF formats of Gemma 4 models for local deployment using tools like llama.cpp.

SOURCES

18. NVIDIA Releases Nemotron 3.5 Content Safety Model

NVIDIA has released Nemotron 3.5 Content Safety, a model designed for enterprise safety enforcement. Built to be integrated into production moderation pipelines, the model supports multimodal and multilingual inputs. It features auditable reasoning capabilities and can be customized to meet specific enterprise safety guidelines.

  • NVIDIA released Nemotron 3.5 Content Safety for enterprise safety enforcement.
  • The model supports multimodal and multilingual inputs.
  • It features auditable reasoning capabilities and is customizable for specific enterprise needs.

Developers can integrate this model into production moderation pipelines to enforce safety with auditable reasoning capabilities.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.