Audesso | Daily: AI

Google Releases Gemma 4 12B Encoder-Free Multimodal Model

00:00 / --:--

← Back to home

Google Releases Gemma 4 12B Encoder-Free Multimodal Model

1. Google Releases Gemma 4 12B Encoder-Free Multimodal Model

Google DeepMind has launched Gemma 4 12B, the first mid-sized open-weights model in its family to process native audio inputs. The model uses an encoder-free architecture, processing 16 kHz audio frames and visual patches directly in the backbone LLM rather than relying on separate vision or audio encoders. Released under the Apache 2.0 license on Hugging Face, Kaggle, and the Google AI Edge Gallery, the 11.95-billion-parameter model runs locally on standard 16GB RAM devices. Google also released an accompanying Multi-Token Prediction (MTP) drafter model to optimize inference latency on local hardware.

  • Gemma 4 12B is an 11.95B parameter decoder-only transformer with a 256K context window licensed under Apache 2.0.
  • Features a unified, encoder-free architecture that routes raw audio (up to 30 seconds) and video patches (up to 60 seconds) directly into the LLM.
  • Requires 16GB of VRAM or unified memory, making it compatible with consumer GPU laptops and Apple Silicon.
  • Google released an accompanying Multi-Token Prediction (MTP) drafter model to reduce local inference latency.
  • Compatible out of the box with llama.cpp, vLLM, SGLang, Ollama, MLX, and Unsloth.

Developers can now deploy a mid-sized local model that handles text, images, video, and native audio directly in the core LLM backbone without separate encoders.

2. Mnemo Launches Local-First Knowledge Graph Memory Layer for LLMs

Released on Hacker News, mnemo is an open-source, local-first AI memory layer designed to give LLMs persistent knowledge graph capabilities. Operating as a sidecar service with zero cloud dependencies, mnemo uses an LLM to extract named entities and their relationships from text, storing them in a local SQLite database. The engine achieves retrieval speeds under 50 milliseconds by utilizing an in-memory petgraph library for atomic updates, exposing its features via a Python SDK, REST API, and CLI tool.

  • mnemo is a local-first AI memory layer distributed as a single static binary.
  • Extracts named entities and relationships from input text using an LLM and stores them in SQLite.
  • Retrievals execute in under 50 milliseconds using an in-memory petgraph to perform atomic updates to the knowledge graph.
  • Integrates with Ollama, OpenAI, Anthropic, and other OpenAI-compatible APIs.
  • Provides a CLI tool, a Python SDK, and a REST API for developer integration.

App developers can integrate a local persistent memory layer into LLM-driven applications with sub-50ms retrieval latency and no external cloud API requirements.

SOURCES

3. Sandboxed Releases Open-Source Local Engine for Agent Playgrounds

The open-source engine sandboxed has been released to help developers build hosting backends for AI app builders and coding playgrounds. Running on a single server powered by Docker, Traefik, and SQLite, the system avoids the complexity of Kubernetes or message queues while providing isolated Linux containers for coding agents. It supports live preview URLs with automatic routing and TLS, alongside stop-on-idle and wake-on-request mechanisms to optimize memory usage. The platform comes pre-configured with OpenCode and Claude Code CLIs inside its environments.

  • sandboxed runs on a single server using Docker, Traefik, and SQLite, bypassing Kubernetes and complex message queues.
  • Includes automatic routing and TLS for live preview URLs of running sandbox applications.
  • Features a stop-on-idle and wake-on-request mechanism to optimize memory usage and reduce hosting costs.
  • Pre-installs OpenCode and Claude Code CLIs to facilitate AI-driven coding tasks within isolated Linux containers.
  • Released under the MIT license, tailored for multi-tenant AI playgrounds and agent builders.

Developers can build multi-tenant AI app builders or coding agent environments without the complexity or cost of orchestrating Kubernetes.

SOURCES

4. Nous Research Launches Hermes Desktop Public Preview for Local Agents

Nous Research has released Hermes Desktop in public preview, providing a native GUI for the autonomous Hermes Agent v0.15.2 on macOS, Windows, and Linux. The desktop application shares its core agent configuration, session storage, and memory with existing CLI versions. It enables developers to run autonomous planning loops across five sandboxed execution backends, including local, Docker, SSH, Singularity, and Modal. The MIT-licensed platform supports tool integration via the Model Context Protocol (MCP) and maintains persistent memory via FTS5 session search.

  • Hermes Desktop is a native cross-platform application in public preview for macOS, Windows, and Linux.
  • Provides a graphical interface for the autonomous, MIT-licensed Hermes Agent v0.15.2.
  • Supports five sandboxed execution backends: local, Docker, SSH, Singularity, and Modal.
  • Integrates Model Context Protocol (MCP) for tool support and features streaming responses and a file browser.
  • Implements persistent agent-curated memory with cross-session recall using FTS5 search and LLM summarization.

It offers an out-of-the-box UI and local agent environment that integrates MCP tools and sandboxed execution across local, Docker, or cloud runtimes.

SOURCES

5. Llama.cpp Optimizes Multi-Token Prediction for Qwen Models

The llama.cpp project has shipped version b9495, delivering key performance optimizations and bug fixes for Multi-Token Prediction (MTP) on Qwen3.5 and Qwen3.6 models. A merged pull request (PR #24025) specifically introduces support for using post-norm hidden states to accelerate MTP execution. Benchmarks shared by community members using the updated runner on the Qwen3.6-35B-A3B-MTP-UD-Q5_K_XL model demonstrate a draft acceptance rate of 0.52614, paving the way for faster text-generation speeds during local execution.

  • Llama.cpp version b9495 introduces optimizations and bug fixes for Qwen Multi-Token Prediction (MTP).
  • A merged pull request (PR #24025) adds support for post-norm hidden states for Qwen3.5 MTP.
  • Optimizations target Qwen3.5 and Qwen3.6 model families, including the Qwen3.6-35B-A3B-MTP-UD-Q5_K_XL variant.
  • A shared benchmark using the optimized MTP configuration reported a draft acceptance rate of 0.52614.

This release increases local inference throughput and decreases latency for developers running Qwen models locally.

SOURCES

6. Developer Configures Android Device as Vulkan-Accelerated Local LLM Node

A developer has successfully configured a Samsung Galaxy Z Fold 6 as a portable, Vulkan-accelerated GGUF inference node within a self-hosted AI mesh. By offloading 89 layers to the mobile GPU via Vulkan, the setup exposes an OpenAI-compatible API endpoint routed locally through LiteLLM. Using Tailscale, the mobile device is linked to a private network that automatically falls back to larger nodes like a Mac Studio or an RTX-equipped machine, allowing the phone to function as a standalone server when disconnected from the mesh.

  • An Android device (Z Fold 6) was configured as a portable GGUF inference node.
  • Utilizes Vulkan GPU acceleration to offload 89 GPU layers.
  • Exposes an OpenAI-compatible endpoint that routes through LiteLLM.
  • Integrated into a self-hosted AI mesh via Tailscale with fallback routing to Mac Studio or RTX-equipped machines.
  • Allows standalone mobile inference when other local nodes are offline.

Demonstrates how developers can utilize high-end mobile hardware as cost-effective, portable nodes in a local self-hosted fallback inference mesh.

SOURCES

7. Alibaba Fun-Realtime-TTS Takes Top Spot on Speech Arena Leaderboard

Alibaba's Fun-Realtime-TTS model has claimed the top spot on Artificial Analysis's Speech Arena Leaderboard, achieving an Elo score of 1,219 over 962 arena matchups. The model outperformed several major commercial alternatives, including Google's Gemini 3.1 Flash TTS and Cartesia Sonic 3.5. Available to developers via Alibaba Cloud API, the model is priced at $27.60 per 1 million characters and supports real-time text-to-speech generation, voice cloning, voice design, and multilingual outputs.

  • Fun-Realtime-TTS reached the #1 spot on the Artificial Analysis Speech Arena Leaderboard with an Elo of 1,219 across 962 appearances.
  • Surpassed Gemini 3.1 Flash TTS, Inworld Realtime TTS-2 Research Preview, and Cartesia Sonic 3.5.
  • Priced at $27.59 (or $27.6) per 1 million characters on Alibaba Cloud.
  • Features include real-time speech generation, voice cloning, voice design, multilingual output, and support for regional accents.

Developers have a new top-performing, cost-competitive option for low-latency voice synthesis and real-time audio interaction.

SOURCES

8. Step-by-Step Guide Details QLoRA and DPO Fine-Tuning for LFM2-1.2B

A complete step-by-step developer tutorial on Google Colab walks through fine-tuning Liquid AI's LFM2-1.2B model using QLoRA, Supervised Fine-Tuning (SFT), and Direct Preference Optimization (DPO). Built on PyTorch, Transformers, TRL, PEFT, and bitsandbytes, the pipeline leverages 4-bit quantization to conserve VRAM. The SFT training process utilizes 500 samples from the 'smoltalk' dataset for 60 steps with a sequence length of 1024, followed by adapter merging and a 40-step DPO sequence to optimize model preferences.

  • Demonstrates fine-tuning LFM2-1.2B on Google Colab using QLoRA, supervised fine-tuning (SFT), and direct preference optimization (DPO).
  • Utilizes standard libraries including Transformers, TRL, PEFT, datasets, bitsandbytes, and PyTorch.
  • Uses 500 samples from the 'smoltalk' dataset over 60 training steps for SFT with a 1024 max sequence length.
  • Applies 4-bit quantization to reduce GPU memory requirements during training.
  • Merges LoRA adapters into the base model and executes a 40-step DPO training phase to refine model response alignment.

Provides a practical blueprint for developers to fine-tune compact state-space or liquid models on consumer hardware using open-source libraries.

SOURCES

9. Vercel Outlines BotID Defenses Against AI Inference Theft

Vercel has published an analysis of AI inference theft, detailing how attackers exploit exposed developer endpoints to scrape and resell LLM access. Because standard rate limits fail to stop sophisticated, distributed extraction attempts, Vercel advises developers to implement BotID analysis. This mechanism verifies the legitimacy of each client request prior to forwarding it to the upstream LLM API, helping developers protect their API keys and avoid unexpected cloud bills.

  • Vercel published a deep dive explaining how attackers exploit exposed application endpoints to resell stolen AI inference.
  • Notes that standard rate-limiting controls are often insufficient to prevent organized inference resale operations.
  • Recommends integrating BotID analysis to verify every incoming AI request and block unauthorized scrapers.

Helps developers secure their API endpoints and prevent escalating API bills from malicious actors who scrape or resell LLM access.

SOURCES

10. Angular v22 Released with Built-In Agentic Tooling and WebMCP Support

Angular v22 has officially been released, delivering a wave of production-ready APIs alongside dedicated agentic tooling. In addition to stabilizing features like Signal Forms and introducing the @Service decorator, the release includes updated Model Context Protocol (MCP) integrations and Angular Agent Skills to help AI assistants navigate modern Angular codebases. Crucially, the update adds experimental support for WebMCP, which enables browser-based AI agents to interact directly with web-based debugging and development tools.

  • Angular v22 features production-ready Signal Forms, Angular Aria, and Asynchronous Reactivity APIs.
  • Includes new agentic tooling, specifically updated MCP offerings and Angular Agent Skills to provide AI assistants with code context.
  • Introduces experimental support for WebMCP, letting agents interact directly with browser tools.
  • Adds a new @Service decorator and asynchronous dependency injection via injectAsync.

Web developers using Angular can now build applications that interface more seamlessly with local and web-based AI coding agents.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.