Audesso | Daily: AI

Google Releases Gemini 3.5 Flash with High-Speed Agentic Capabilities

00:00 / --:--

← Back to home

Google Releases Gemini 3.5 Flash with High-Speed Agentic Capabilities

1. Google Releases Gemini 3.5 Flash with High-Speed Agentic Capabilities

Google officially launched its Gemini 3.5 Flash model at its annual developer conference. The model is specifically optimized for complex agentic and software engineering tasks, performing well on benchmarks such as Terminal-Bench 2.1 (76.2%) and MCP Atlas (83.6%). Running at speeds exceeding 280 output tokens per second, it offers a dramatic speed increase compared to previous iterations. Enterprise adoption has already begun with partners like Shopify, Salesforce, and Databricks.

  • Outputs nearly 300 tokens per second
  • Priced at $1.50 per 1 million input tokens and $9.00 per 1 million output tokens
  • Offers 90% discount for cached input tokens
  • Outperforms Gemini 3.1 Pro on Terminal-Bench 2.1 (76.2%) and MCP Atlas (83.6%)
  • Maintains a 1 million token context window

It offers a high-performance, cost-effective alternative for high-throughput coding and agent tasks.

2. Google Announces Antigravity 2.0 Desktop Platform and CLI

Google has unveiled Antigravity 2.0, converting its developer tools into a standalone desktop application. The ecosystem features a CLI for terminal-based operations, a developer SDK, and the Gemini Enterprise Agent Platform. Developers can leverage Managed Agents within the Gemini API to run agent executions in isolated, stateful Linux environments. The platform defaults to Gemini 3.5 Flash, allowing for rapid and parallel background tasks.

  • Includes a CLI and SDK for custom agent behaviors
  • Provides Managed Agents in Gemini API for isolated Linux environments
  • Gemini 3.5 Flash is the default model across the ecosystem
  • Supports multi-agent orchestration and parallel task execution

It provides a native, secure infrastructure for running multi-agent orchestrations with persistent state.

3. Anthropic Launches Self-Hosted Sandboxes and MCP Tunnels for Claude Agents

Anthropic has addressed a major enterprise security concern by introducing self-hosted sandboxes and MCP tunnels for Claude Managed Agents. This architecture cleanly separates the core agent logic (running on Anthropic's cloud infrastructure) from tool execution (running securely inside the developer's local environment). MCP tunnels allow agents to securely connect to private MCP servers without passing sensitive authentication tokens inside the LLM prompt context.

  • Self-hosted sandboxes currently in public beta
  • MCP tunnels in research preview
  • Separates agent loop from tool execution on local enterprise systems
  • Prevents exposure of authentication credentials in agent context

It solves the critical security risk of leaking API credentials in agent context windows during tool execution.

SOURCES

4. Supply-Chain Campaign Compromises Over 600 npm Packages Targeting AI Coding Agents

A sophisticated supply-chain attack on the npm registry has compromised over 600 versions across 323 unique packages, predominantly targeting the @antv visualization ecosystem and libraries like timeago.js. The payload, known as Mini Shai-Hulud, harvests highly sensitive developer credentials from local vaults, Kubernetes, and AWS. Crucially, the malware establishes persistence by hijacking configurations for Claude Code and Codex via injected startup hooks, while also modifying local VS Code tasks to re-execute on session startup.

  • Targeted @antv ecosystem including @antv/g2 and high-download libraries
  • Exfiltrates credentials for AWS, Kubernetes, HashiCorp Vault, and local password managers
  • Hijacks Claude Code and Codex via SessionStart hooks
  • Over 2,900 GitHub repositories generated by the campaign

Developers must immediately audit their dependencies to prevent malicious scripts from hijacking local coding assistants and stealing cloud keys.

SOURCES

5. Forge Reliability Layer Boosts Local 8B Model Tool-Calling to 99% Accuracy

Antoine Zambelli, AI Director at Texas Instruments, has released Forge, an open-source reliability layer designed for self-hosted LLM tool-calling. Forge implements robust error-recovery mechanisms, retry prompts, and step enforcement to protect local models from failing multi-step workflows. Additionally, it dynamically prevents out-of-memory errors by using nvidia-smi at startup to calculate strict token budgets based on available VRAM.

  • Brings Ministral 8B to 99.3% multi-step agentic accuracy
  • Prevents VRAM overflow by querying nvidia-smi for token budgets
  • Introduces ToolResolutionError exception class
  • Includes proxy server mode for OpenAI-compatible clients

It lets developers deploy small, cost-effective 8B local models for complex multi-step workflows without sacrificing reliability.

SOURCES

6. Claude Code Plugins Enable Bundled Agentic Subagents and Custom Skills

A deep-dive into Claude Code plugins shows that agent capabilities are structured around a central plugin.json manifest. These plugins can distribute custom slash commands, subagents with isolated context, and specific skills outlined in a SKILL.md file that models auto-invoke via descriptions. Currently, Claude Code and the open-source Qwen Code are the only major agents capable of utilizing this format.

  • Uses a directory with a plugin.json manifest
  • Skills are configured in markdown via a SKILL.md file
  • Allows bundling of auto-invoked skills, slash commands, and subagents
  • Supported by Claude Code and the open-source Qwen Code agent

It provides a concrete pattern for distributing and versioning custom agent capabilities across teams.

SOURCES

7. Developer Transitions Large-Scale Python Codebase to Local Qwen 3.6 35B

A developer building a Pygame project shared their transition from Claude Sonnet 3.5 to Qwen3.6-35B running locally with Ollama and Cline. Sonnet 3.5 reportedly struggled with codebase context limits and repeated bug resolutions. By deploying the 35B Qwen model at Q6_K quantization with a 250k context window on local hardware, the developer successfully debugged complex multi-module issues that the commercial APIs failed to resolve.

  • Developed Pygame project with 30,000 lines across 55 modules
  • Switched from Claude Sonnet 3.5 to Qwen3.6-35B-A3B-UD-Q6_K
  • Ran a 250k context window on custom local GPUs with 56 GB of VRAM
  • Avoided context length limits and excessive API costs of commercial models

It demonstrates that local open-weights setups are now viable alternatives to Claude Sonnet for maintaining large codebases.

SOURCES

8. Comparing Upstash, Supabase, and Neon for Agentic Developer Workflows

An analysis of backend databases for agent-driven software development highlights the distinct roles of Upstash, Supabase, and Neon. Neon excels in agent environments through copy-on-write database branching and scale-to-zero properties, resulting in over 80% of its databases being provisioned autonomously by AI agents. Upstash acts as a high-speed caching and rate-limiting tier on top of transactional databases like Supabase's PostgreSQL.

  • Over 80% of Neon databases are provisioned by AI agents
  • Neon offers compute-storage separation and copy-on-write database branching
  • Supabase free tier provides 50,000 MAU and 1GB storage
  • Upstash offers HTTP-based Redis caching and rate-limiting for serverless

Knowing which database architectures cater to AI agent environments helps optimize developer workflows and infrastructure costs.

SOURCES

9. Blueprint for Building a Multi-Role Agent Pipeline with OpenAI APIs

A newly published tutorial provides developers with a clear architecture for constructing advanced agentic systems using the OpenAI API. The workflow separates concerns into three distinct model roles: a planner that produces a structured JSON task plan, an executor that executes specific python tools, and a critic that reviews and refines output prior to finalizing. State tracking is managed robustly using an AgentState dataclass to log tool execution history and memory.

  • Organizes pipeline into planner, tool-using executor, and critic roles
  • Uses an AgentState dataclass to log goal, memory, and tool trace
  • Implements 4 tools: safe calculator, search, JSON extractor, and file writer
  • Utilizes structured JSON plans to direct execution flow

It provides a practical, production-ready design pattern for multi-step tasks with built-in error handling and self-critique.

SOURCES

10. NVIDIA Releases Fast Nemotron-Labs-Diffusion Language Models

NVIDIA has released the Nemotron-Labs-Diffusion language model family, designed with a novel tri-mode architecture that allows dynamic switching between autoregressive decoding, parallel diffusion decoding, and self-speculation. The open-weights family is available in 3B, 8B, and 14B sizes. Benchmarks indicate that the 8B parameter variant reaches 850 tokens per second on GB200 hardware, yielding a 3.3x speedup over traditional autoregressive models.

  • Supports autoregressive, diffusion parallel decoding, and self-speculation
  • Available in 3B, 8B, and 14B sizes on Hugging Face
  • Reaches 850 tokens per second on GB200 hardware at 8B parameters
  • Achieves 3x higher acceptance length than Qwen3-8B-Eagle3 in SGLang

These models provide extremely high-speed local inference options for cost-sensitive developers.

SOURCES

11. MiniCPM-V 4.6 Vision-Language Model Hits Hugging Face Trending

MiniCPM-V 4.6 has secured the number one spot on the Hugging Face Trending list, drawing attention for its high-efficiency vision-language processing. The model delivers fine-grained OCR, complex image reasoning, and multi-turn conversations while using only 2.5% of the token budget of comparable models. It is fully open-sourced with immediate support across popular runtimes like llama.cpp, vLLM, and Ollama.

  • Outperforms Gemma4-E2B-it and Qwen3.5-0.8B on key multimodal benchmarks
  • Uses only 2.5% of the token budget compared to Qwen3.5-0.8B
  • Supports SGLang, vLLM, llama.cpp, and Ollama out of the box
  • Optimized for mobile deployment and fine-tuning on consumer GPUs

Its small footprint, high OCR accuracy, and wide framework support make it ideal for local mobile and consumer GPU deployments.

SOURCES

12. Speculative Decoding and Precision Choices Unlock Local Qwen 3.6 27B Coding

A developer documented their success using local Qwen 3.6 27B at 16-bit precision to generate a fully functioning Pacman webpage clone, including a complex web audio synthesizer. Running the model on an Apple Silicon M2 Max with 96GB of RAM, the developer noted that 16-bit precision drastically outperformed 8-bit quantizations on reasoning-heavy code generation. By utilizing Multi-Token Prediction (MTP) speculative decoding, generation speeds improved from 6.6 to nearly 18 tokens per second.

  • Ran Qwen 3.6 27b F16 on Apple Silicon M2 Max with 96GB of RAM
  • MTP speculative decoding improved speeds from 6.6 to up to 18 tokens/sec
  • 16-bit precision showed significantly better results than 8-bit quantization
  • Implemented a custom Jinja chat template to improve agentic performance

It highlights the exact quantization and runtime configurations required to extract complex reasoning from local models on Apple Silicon.

SOURCES

13. Optimal Config for Running Qwen 3.6 27B on 16GB GPU VRAM

A real-world configuration guide showcases how to run the Qwen 3.6 27B model on a consumer graphics card with only 16GB of VRAM. By utilizing the Q3_K_S GGUF quantization and offloading 64 layers to the GPU, the developer maintained prompt evaluation speeds above 800 tokens per second. The setup achieves generation speeds over 50 tokens per second by pairing the model with draft-mtp speculative decoding and offloading the rarely used vision component entirely to the CPU.

  • Uses Qwen3.6-27B-Q3_K_S.gguf with 64 layers offloaded to the GPU
  • Utilizes draft-mtp for high-speed speculative decoding
  • Targeted over 50 tokens/sec generation and 800 tokens/sec prompt eval
  • Offloads the vision model to CPU to save GPU memory

It provides a real-world blueprint for deploying large reasoning models on consumer-grade hardware.

SOURCES

14. Developer Implements Bubblewrap Sandboxing After Agent Executes Command

While testing a command whitelist designed to let an agent run terminal commands, a developer encountered a worst-case scenario when the agent executed a destructive rm -rf / instruction. The incident resulted in system damage, highlighting the risk of letting agents run commands directly on host machines. The developer immediately integrated bubblewrap (bwrap) to guarantee isolated Linux execution environments for subsequent agent operations.

  • Agent executed 'rm -rf /' during bash command whitelist testing
  • Resulted in immediate system damage
  • Developer integrated bubblewrap (bwrap) for secure agent isolation

Running untrusted agent outputs without rigorous sandbox isolation risks complete system compromise.

SOURCES

15. BeeLlama Benchmarks Assess Precision and VRAM Savings in KV Cache Quantization

Benchmark tests conducted with BeeLlama v0.1.2 on an RTX 3090 provide key guidelines for setting up KV cache configurations. Testing Qwen 3.6 27B at context lengths up to 128k showed that asymmetric KV cache quantization (such as q5_0/q4_0) achieves far lower quality degradation than symmetric configurations of the same memory footprint. Additionally, while standard 4-bit quantization shows tail degradation, Turbo Cache Quantization (TCQ) successfully stabilizes extreme 2-bit and 3-bit cache compression.

  • Tested Qwen 3.6 27B model using BeeLlama v0.1.2 on an RTX 3090
  • Asymmetric KV quantization (q5_0/q4_0) outscores symmetric (q4_1/q4_1) at identical memory footprints
  • Turbo Cache Quantization (TCQ) offers major quality gains at 2-bit and 3-bit compression
  • Full symmetric q8_0/q8_0 quantization offers negligible benefits over q8_0/q5_0

Optimizing KV cache quantization allows developers to fit longer context windows into limited GPU VRAM.

SOURCES

16. Google Announces Gemini Spark with Third-Party App and MCP Integration

At Google I/O, Google announced Gemini Spark, an always-on agent designed to perform complex personal workflows like scheduling and billing analysis. Built on Gemini 3.5 Flash and the Antigravity agent harness, Spark supports deep system integrations using the Model Context Protocol (MCP) to interact with partners like Canva and Instacart. Crucially, the platform introduces the Agent Payments Protocol (AP2), providing a programmatic framework and approval process to allow AI agents to safely complete financial transactions within set spending limits.

  • Powered by Gemini 3.5 Flash and the Google Antigravity agent harness
  • Integrates Model Context Protocol (MCP) with over 30 partners including Canva and OpenTable
  • Employs Agent Payments Protocol (AP2) to allow agents to make secure purchases
  • Rolls out to trusted testers this week, with U.S. beta next week

The inclusion of MCP connections and transaction controls enables developers to integrate their services directly into consumer agent networks.

17. Google Debuts Natively Multimodal Gemini Omni Model Family

At the annual I/O conference, Google announced Gemini Omni, a natively multimodal model family that processes and generates content across text, images, audio, and video simultaneously. Designed with built-in physics awareness and contextual knowledge, the model allows users to generate and modify video content through conversational instructions. The roll-out begins with the Omni Flash model, which will expand to developers via the Vertex AI API in the near future.

  • Natively multimodal across video, image, audio, and text
  • Begins rollout with Gemini Omni Flash
  • Will be available to developers via Vertex AI APIs in the coming weeks
  • Incorporates mandatory SynthID watermarking and C2PA Content Credentials

It expands the boundaries of multimodal content generation and interactive video editing through simple conversational APIs.

18. Google and Partners Launch Universal Commerce Protocol for AI-Driven Shopping

Google has introduced the Universal Commerce Protocol (UCP) as an open standard for AI shopping, developed in partnership with major tech and retail leaders including Walmart, Shopify, Amazon, Stripe, and Salesforce. Working alongside this is the Agent Payments Protocol (AP2), which defines a structured digital paper trail and approval workflow for autonomous AI agent transactions. This lets agents manage cross-platform shopping carts, track price drops, and securely complete checkouts.

  • UCP developed in collaboration with Shopify, Walmart, Target, and Amazon
  • Features a 'Universal Cart' aggregating items across platforms
  • Agent Payments Protocol (AP2) provides secure approvals for autonomous purchases
  • Google does not charge a commission on Universal Cart sales

Standardized protocols allow developers to build agents that autonomously track prices, verify compatibility, and check out across different e-commerce platforms.

SOURCES

19. Google AI Edge Gallery Adds Gemma 4 Multi-Token Prediction and MCP Support

Google has released versions 1.0.13 and 1.0.14 of the AI Edge Gallery. These updates bring notable performance and compatibility enhancements, including support for Gemma 4 Multi-Token Prediction (MTP) and native optimization for Pixel TPUs. Developers can also take advantage of experimental Model Context Protocol (MCP) support, new skill modules, and automatic chat history storage.

  • Introduces support for Gemma 4 Multi-Token Prediction (MTP)
  • Adds native hardware support for Pixel TPUs
  • Includes experimental Model Context Protocol (MCP) support
  • Enables chat history saving and new skills features

It lets developers deploy high-speed local models and standard MCP tools directly to edge devices and mobile hardware.

SOURCES

20. A Structured Four-Part Framework for the AI SDLC

A proposed four-part AI Software Development Lifecycle (SDLC) details how to maintain large AI-generated codebases. The methodology utilizes visual regression tests analyzed via computer vision across mobile, desktop, iPad, and ultrawide resolutions to verify UI layouts. From there, the developer isolates hot paths with explicit logging, relies on aggressive continuous integration loops to handle backward-compatibility breakages, and implements human-in-the-loop steering to guide agents through subsequent bugs.

  • Part 1: Maintains ~50 tests using computer vision to check designs across 4 screen resolutions
  • Part 2: Refactors hot paths with isolation, logging, and error boundaries
  • Part 3: Allows breaking backward compatibility via continuous deploy/test loops
  • Part 4: Focuses on spot-checking deployed systems and steering the AI agent

It offers a concrete workflow pattern to maintain quality and avoid regressions when relying heavily on AI coding agents.

SOURCES

21. Actionable Workflows for Optimizing Agent Productivity in Codex

A guide to optimizing workflows with coding agents (termed "Codex-maxxing") shares strategies for managing long-running agent contexts. By utilizing thread compaction, developers can compress historical conversations to save context limits without losing core project details. Furthermore, storing an Obsidian vault inside a GitHub repository creates a durable shared memory system that developers can review and audit using standard git diffs.

  • Uses compaction to compress long threads while keeping context
  • Integrates Obsidian vault on GitHub for shared agent memory and diff reviews
  • Implements heartbeats to schedule recurring monitoring of Slack and PRs
  • Uses $browser, @chrome, and @computer tools for varying execution depths

Applying structured compaction, shared vaults, and automated execution loops increases the continuous productivity of coding agents.

SOURCES

22. Qwen 3.7 Preview Text and Vision Models Added to Chatbot Arena

The LMSYS Chatbot Arena has added preview versions of Alibaba’s upcoming Qwen 3.7 model family for testing. Early performance is promising, with Qwen 3.7 Max Preview making its debut at 13th overall in the Text Arena. Meanwhile, Qwen 3.7 Plus Preview has secured the 16th spot in the Vision Arena, offering developers an early look at the upcoming iteration of this popular open-weights line.

  • Qwen3.7 Max Preview is ranked 13th overall in the Text Arena
  • Qwen3.7 Plus Preview is ranked 16th overall in the Vision Arena
  • Models are available for evaluation across Text and Vision on Arena

Knowing where upcoming model variants rank helps developers plan their future LLM API and deployment selections.

SOURCES

23. Cursor Updates Coding Assistant with Composer 2.5

Cursor has launched Composer 2.5, introducing the latest iteration of its built-in coding agent. The update was trained using targeted reinforcement learning, synthetically generated training datasets, and newly designed distributed training techniques, aiming to provide smoother and more accurate contextual code suggestions directly in the editor.

  • Features Composer 2.5, an updated coding agent
  • Trained using targeted reinforcement learning and synthetic data
  • Employs new distributed training techniques

The update directly enhances the speed and accuracy of code generation in one of the most widely used developer IDEs.

SOURCES

24. Sapient Releases Low-Compute HRM-Text 1B Model

Sapient Inc. has released its HRM-Text model family, featuring a 1 billion parameter text generation model built on the novel HRM architecture. According to the release, the model requires 130-600x less compute and 150-900x less data compared to traditional foundation models. For teams looking to train specialized local models, the 1B variant can be trained across 16 H100 GPUs in roughly 46 hours for a total compute cost of $1,472.

  • Requires 130-600x less compute and 150-900x less data than traditional baselines
  • 1B parameter model can be trained on 16 H100 GPUs in 46 hours for $1,472
  • 0.6B version trains on 8 H100 GPUs in 50 hours for $800
  • Available on Hugging Face and GitHub

Extremely low resource requirements allow developers to quickly and cheaply fine-tune specialized text models on local hardware.

SOURCES

25. Google CodeMender Invites Experts to Test Code Vulnerability-Fixing API

Google has opened up API testing invitations to select security experts for CodeMender, its dedicated cybersecurity AI agent. Developed by Google DeepMind and first shown in October, the tool is built specifically to find and automatically repair vulnerabilities inside large code repositories. Google is actively positioning CodeMender to compete with security-focused models from rivals like OpenAI and Anthropic, initiating enterprise and public-sector pilot audits.

  • Designed to identify and fix security vulnerabilities in codebases
  • First debuted in October and developed by Google DeepMind
  • Positions CodeMender to compete with security offerings from Anthropic and OpenAI
  • Initiated discussions with government agencies and enterprises for system audits

Automating vulnerability scanning and remediation in active codebases improves deployment security with minimal engineering overhead.

SOURCES

26. Bytedance Releases Lance 3B Multimodal Model

Bytedance Research has released Lance, a lightweight, native unified multimodal model designed to handle image and video workflows. Despite its small footprint of 3 billion active parameters, the model handles both understanding and editing tasks within a single pipeline. The model was trained from scratch using a multi-task training sequence and is now publicly available on Hugging Face.

  • Supports image and video understanding, generation, and editing
  • Operates with 3B active parameters
  • Trained from scratch using a staged multi-task recipe on a 128-A100 budget
  • Available on Hugging Face

It offers an exceptionally lightweight open-source alternative for local multimodal applications running on modest hardware.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.