Apple Unveils Siri AI and Foundation Models Framework at WWDC 2026

1. Apple Unveils Siri AI and Foundation Models Framework at WWDC 2026

At WWDC 2026, Apple announced a major overhaul of its AI ecosystem, introducing Siri AI powered by Google Gemini technology. For developers, the most significant updates lie in the expanded Foundation Models framework, which now supports image input, custom skills, and server-side execution. Apple is also offering free access to its Foundation Models within Private Cloud Compute for indie developers with fewer than 2 million App Store downloads, significantly lowering the cost of AI experimentation. Additionally, Xcode's coding assistant has been upgraded to support agentic coding, localization, and simulated device interactions.

• Apple introduced Siri AI, a rebuilt assistant utilizing Google Gemini models for advanced conversational and systemwide app-interaction capabilities.
• The updated Foundation Models framework now supports image input, custom skills, and server-side model execution.
• Developers with fewer than 2 million first-time App Store downloads can access Apple's Foundation Models in Private Cloud Compute with no cloud API costs.
• Xcode's coding assistant has been updated to handle app localization, interact with simulated devices, and support custom skills.
• Apple expanded App Intents support to allow third-party applications to integrate directly with Siri.

Developers can now build agentic workflows using Apple's updated Foundation Models framework, leverage custom skills in Xcode, and access Private Cloud Compute with zero cloud API costs if they have fewer than 2 million downloads.

SOURCES

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26]

2. Xiaomi and TileRT Push 1-Trillion-Parameter MoE Model Past 1000 TPS

Xiaomi's MiMo team, in collaboration with the TileRT systems group, has released MiMo-V2.5-Pro-UltraSpeed, a high-speed serving mode for their 1-trillion-parameter Mixture-of-Experts (MoE) model. By combining MXFP4 quantization, DFlash speculative decoding, and the TileRT persistent engine runtime, the system achieves decoding speeds over 1000 tokens per second on a standard 8-GPU commodity node without relying on custom hardware like Cerebras or Groq. The team has open-sourced the model checkpoint on Hugging Face and released select TileRT modules on GitHub, while also offering a limited API trial.

• Xiaomi and TileRT released MiMo-V2.5-Pro-UltraSpeed, achieving decoding speeds exceeding 1000 tokens per second on a single 8-GPU commodity node.
• The performance is driven by MXFP4 quantization on MoE experts, DFlash speculative decoding, and the TileRT persistent engine runtime.
• DFlash speculative decoding uses block-level masked parallel prediction to achieve an average acceptance length of 6.30 in coding tasks.
• Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face and released select TileRT modules on GitHub.
• An application-based API trial is available from June 9 to June 23, 2026, priced at three times the standard MiMo-V2.5-Pro rate.

Developers can now run ultra-fast inference on a massive 1-trillion-parameter Mixture-of-Experts model using standard commodity hardware rather than specialized wafer-scale chips.

SOURCES

[1] [2] [3]

3. DeepSeek V4 Pro Outperforms GPT-5.5 Pro on Precision Benchmark

In a recent benchmark matchup, DeepSeek V4 Pro outperformed GPT-5.5 Pro with a score of 38.0 to 33.0, showcasing superior precision and reliability. During a python log redactor task, DeepSeek V4 Pro demonstrated its strict adherence to constraints by successfully utilizing a single regex and replacer to handle overlapping patterns. In contrast, GPT-5.5 Pro failed to handle the task as effectively, opting to split the work across multiple regexes.

• DeepSeek V4 Pro defeated GPT-5.5 Pro in a benchmark matchup with a score of 38.0 to 33.0.
• The model demonstrated higher reliability and stricter adherence to constraints compared to GPT-5.5 Pro.
• On a python log redactor task, DeepSeek V4 Pro successfully used a single regex and replacer to handle overlapping patterns, whereas GPT-5.5 Pro split the work across multiple regexes.

Developers seeking high-precision code generation and strict constraint adherence have a highly competitive alternative to GPT-5.5 Pro.

SOURCES

[1]

4. xAI Releases grok-imagine-video-1.5-preview with Native Audio

xAI has released grok-imagine-video-1.5-preview, a new video generation model available via its API. The model supports image-to-video generation with native audio for durations up to 15 seconds, and is capable of complex stylistic transformations such as turning real-world images into anime-style animations. It currently ranks second in the Artificial Analysis Video Arena's Image to Video (With Audio) category, trailing only ByteDance's Seedance 2.0. The API service is priced at $8.40 per minute of generated video.

• xAI released grok-imagine-video-1.5-preview, an image-to-video generation model supporting native audio.
• The model generates videos up to 15 seconds long and is capable of stylistic transformations like anime-style rendering.
• It ranks #2 in the Artificial Analysis Video Arena's Image to Video (With Audio) category, trailing only ByteDance's Seedance 2.0.
• The API service is priced at $8.40 per minute of generated video.
• The model is currently available via xAI's API, with a rollout to the Grok app and X underway.

Developers can now programmatically generate high-quality, short-form videos with synchronized native audio via xAI's API.

SOURCES

[1] [2]

5. Malicious Microsoft Packages Target AI Coding Agents in Supply-Chain Attack

In a sophisticated supply-chain attack, dozens of cryptographically verified open-source packages from Microsoft were compromised to include credential-stealing code. The malware, tracked as Miasma (a clone of the Mini Shai-Hulud toolkit), is specifically designed to be triggered by AI coding agents. The 28 KB payload harvests credentials from AWS, Azure, GCP, Kubernetes, password managers, and over 90 developer tool configurations, as well as OIDC tokens used in SLSA provenance attestation. GitHub has disabled 73 affected packages, and Microsoft has removed the repositories to investigate the breach, which occurred after a threat actor compromised Microsoft publishing credentials.

• Dozens of cryptographically verified open-source packages from Microsoft were compromised to include credential-stealing code.
• The malicious payload, tracked as Miasma, is triggered specifically by AI coding agents.
• GitHub disabled 73 malicious packages, and Microsoft acknowledged the compromise, removing the affected repositories.
• The 28 KB payload steals credentials from AWS, Azure, GCP, Kubernetes, password managers, and over 90 developer tool configurations.
• The threat actor, tracked as TeamPCP, bypassed build pipelines by compromising Microsoft publishing credentials.

Developers using AI coding assistants must audit their dependencies immediately, as compromised packages are specifically designed to trigger malicious payloads during automated agent execution.

SOURCES

[1]

6. LangSmith Launches Sandboxes for Secure Agent Execution

LangSmith has launched Sandboxes, a new feature providing hardware-virtualized microVMs designed to give AI agents a secure computing environment. Sandboxes allow agents to execute dynamic tasks, manage persistent state, and run complex workflows without compromising production infrastructure. This feature directly addresses the security risks of running untrusted, LLM-generated code by isolating execution within secure, lightweight virtual machines.

• LangSmith introduced Sandboxes, which are hardware-virtualized microVMs designed for AI agents.
• The Sandboxes provide a secure computing environment to execute dynamic tasks and run complex workflows.
• The feature allows agents to manage persistent state without compromising production infrastructure.
• Sandboxes are designed to mitigate the security risks associated with running untrusted code generated by LLMs.

Developers can safely allow AI agents to execute untrusted code and run complex workflows without risking production infrastructure.

SOURCES

[1]

7. Cursor Updates Design Mode with Direct Element Interaction

Cursor has rolled out an update to its Design Mode, enhancing how developers interact with running applications. The updated mode allows users to point, draw, and click directly on UI elements, as well as narrate desired changes. This visual-first approach makes it easier to prototype and iterate on front-end designs directly within the editor.

• Cursor updated its Design Mode to support pointing, drawing, and clicking on UI elements.
• The update enables users to narrate changes directly on a running product.
• The feature streamlines visual editing and front-end development workflows.

Developers can accelerate UI prototyping and front-end iterations by interacting visually with their running applications inside Cursor.

SOURCES

[1]

8. Intuned Launches Code-First Browser Automation Platform with Self-Healing AI

Intuned (YC S22) has launched a code-first platform designed for building, deploying, and maintaining browser automations for websites lacking APIs. Developers write automations using Playwright-based TypeScript or Python, while Intuned's managed runtime handles infrastructure tasks like authentication, session reuse, and concurrency. To address the fragility of web scraping, the platform integrates an AI agent built on the Claude Agent SDK that automatically detects failures, analyzes execution traces, and deploys self-healing fixes when website structures change.

• Intuned is a code-first platform for building, deploying, and maintaining browser automations using Playwright-based TypeScript or Python.
• The platform provides a managed runtime that handles authentication, session reuse, scheduling, and concurrency.
• An integrated AI agent, built on the Claude Agent SDK, assists in creating automations and automatically proposes or deploys fixes when failures are detected.
• Intuned captures execution context (logs, traces, parameters) to facilitate debugging and AI-assisted repairs.
• A Web Task API allows programmatic access to the platform's infrastructure and agent capabilities.

Developers can build robust web scrapers and browser automations that automatically heal when target website structures change, reducing maintenance overhead.

SOURCES

[1]

9. OpenEnv Transitions to Open-Source Agentic Execution Environment

OpenEnv, a tool designed for creating agentic execution environments like terminals and browsers, is transitioning to an open-source model. The project will be governed by a committee featuring members from Meta-PyTorch, Unsloth, Modal, Prime Intellect, Nvidia, Hugging Face, and others. OpenEnv provides a standardized environment for training and running AI agents, and has already seen adoption and support from major organizations including the PyTorch Foundation, vLLM, Lightning AI, and Scale AI.

• OpenEnv is a tool designed for creating agentic execution environments such as terminals and browsers.
• The project is transitioning to an open-source model governed by a committee.
• Committee members include representatives from Meta-PyTorch, Reflection, Unsloth, Modal, Prime Intellect, Nvidia, Mercor, Fleet AI, and Hugging Face.
• The project has been adopted and supported by organizations including the PyTorch Foundation, vLLM, SkyRL, Lightning AI, and Scale AI.

Developers building AI agents gain access to a standardized, open-source sandbox environment supported by PyTorch, Hugging Face, Unsloth, and Modal.

SOURCES

[1]

10. Amazon Bedrock Optimizes Console for Anthropic and OpenAI APIs

Amazon Bedrock has introduced a redesigned console optimized specifically for Anthropic- and OpenAI-compatible APIs. The new console includes a comprehensive model catalog, project-based workflows, and live documentation that automatically generates code snippets. Available across multiple AWS Regions, the update is designed to streamline the process of evaluating models and deploying them into production environments.

• Amazon Bedrock launched a new console optimized for Anthropic- and OpenAI-compatible APIs.
• The console features a comprehensive model catalog, project-based workflows, and live documentation with automatic code snippets.
• The tool is available in multiple AWS Regions to simplify the transition from evaluation to production.

Developers can more easily evaluate, deploy, and transition models to production within AWS using standardized API formats.

SOURCES

[1]

11. OpenAI Introduces Lockdown Mode to Prevent Prompt Injection

OpenAI has introduced a new security feature called Lockdown Mode, designed to mitigate the risk of prompt injection attacks originating from untrusted external content and webpages. When enabled, Lockdown Mode disables high-risk dynamic features including live browsing, web image retrieval, deep research, and agent mode. The feature maintains core functionality for cached content and image generation, allowing users to safely interact with external data.

• OpenAI introduced Lockdown Mode to reduce the risk of prompt injection attacks from external content and webpages.
• The mode disables live browsing, web image retrieval, deep research, and agent mode.
• It maintains core functionality for cached content and image generation while active.

Developers and enterprise users can secure their LLM interactions against malicious external content by selectively disabling high-risk dynamic features.

SOURCES

[1]

12. Google Research Introduces Agentic RAG for Multi-Hop Queries

Google Research has introduced an agentic RAG framework integrated into the Gemini Enterprise Agent Platform, now available in public preview. The framework powers a new Cross-Corpus Retrieval feature designed for complex, multi-hop enterprise queries. It utilizes a multi-agent architecture where a specialized "Sufficient Context Agent" iteratively identifies missing information and logs gaps to ensure complete context before generating a response. Google reports that this approach improves factuality accuracy by up to 34% compared to standard RAG systems, while keeping latency overhead within 3% of single-corpus setups.

• Google Research introduced an agentic RAG framework featuring a new Cross-Corpus Retrieval capability in public preview.
• The framework uses a multi-agent architecture including an Orchestrator, Planner, Query Rewriter, Search Fanout, Sufficient Context, and Synthesis Agent.
• The Sufficient Context Agent enables iterative searching by identifying missing information and logging gaps before generating a response.
• The system improved factuality accuracy by up to 34% compared to standard RAG systems, achieving 90.1% accuracy on the FramesQA benchmark.
• Latency for cross-corpus retrieval remained within 3% of single-corpus settings during testing.

Developers can build more reliable enterprise search systems with up to 34% higher factuality accuracy for complex, multi-hop queries.

SOURCES

[1]

13. Luce Spark Runs 35B MoE Models on 16GB GPUs Without Offload Penalties

The open-source project Luce Spark has been released under the Apache 2.0 license, offering a way to run 33-35B Mixture-of-Experts (MoE) models, such as Qwen3.6 35B-A3B, on consumer-grade 16GB GPUs. Instead of paying a heavy offload tax, Spark keeps active experts on the GPU and swaps others from system RAM using a bounded asynchronous cache. The system dynamically self-tunes expert placement based on live routing data, achieving roughly 100 tokens per second (about 85% of the performance of an all-GPU configuration) without requiring offline calibration.

• Luce Spark is an Apache 2.0 licensed open-source project that reduces VRAM requirements for 33-35B MoE models to under 16 GiB.
• The system keeps active experts on the GPU while swapping inactive ones from system RAM using a bounded asynchronous cache.
• Spark self-tunes expert placement based on live routing data, eliminating the need for offline calibration.
• The system achieves approximately 100 tokens per second at 60% residency, compared to 119 tokens per second for full GPU residency.
• The project is available on GitHub but currently lacks extensive testing on physical 16 GB hardware.

Developers can self-host and run larger, highly capable MoE models on consumer-grade 16GB GPUs with only a minor performance trade-off.

SOURCES

[1]

14. Gemma 4 Performance Nearly Doubles on Consumer GPUs via QAT and MTP

Recent optimizations combining Quantization Aware Training (QAT) and Multi-Token Prediction (MTP) have significantly improved local LLM performance on GPUs with 24GB of VRAM or less. Support for Gemma 4 MTP was recently merged into llama.cpp (starting with release b9551), resulting in Gemma 4 31b performance jumping from 40 to 70-80 tokens per second on an NVIDIA RTX 3090. Additionally, developers are implementing MTP support for smaller Gemma models to target low-power hardware like mobile devices and Raspberry Pi.

• Gemma 4 31b performance increased from 40 to 70-80 tokens per second on an NVIDIA RTX 3090 GPU.
• Multi-Token Prediction (MTP) support for Gemma 4 was merged into llama.cpp starting with release b9551.
• Testing on a 26b model showed a 1.26x speedup (from 143 to 180 tokens per second) using MTP with an n-max of 1.
• Llama.cpp is also implementing MTP support for tiny Gemma models targeted at low-power hardware like Raspberry Pi and mobile devices.
• The performance gains are driven by a combination of Quantization Aware Training (QAT) and MTP.

Developers running local models can achieve up to a 1.8x speedup on consumer-grade hardware like the RTX 3090.

SOURCES

[1] [2] [3]

15. Compiling llama.cpp with Custom Flag Saves 1.5 GB VRAM

Developers running local models via llama.cpp can reclaim up to 1.5 GB of VRAM by compiling the project with a custom flag. By default, llama.cpp enables pipeline parallelism when offloading all layers to the GPU, allocating four compute buffer copies in VRAM (GGML_SCHED_MAX_COPIES=4). However, testing shows that this default configuration provides no inference speed benefit over a single copy. Compiling with -DGGML_SCHED_MAX_COPIES=1 prevents this extra allocation, saving significant VRAM and preventing the bloat from negating savings achieved through context cache quantization.

• Llama.cpp enables pipeline parallelism by default when offloading all model layers to the GPU, allocating four compute buffer copies in VRAM.
• Compiling llama.cpp with the -DGGML_SCHED_MAX_COPIES=1 option prevents the allocation of extra compute buffers.
• Testing indicates that pipeline parallelism with four copies provides no inference speed benefit compared to using one copy or disabling it entirely.
• The default four-copy configuration consumed an additional 1.5 GB of VRAM, partially negating VRAM savings from context cache quantization.
• The testing was conducted on a mixed AMD Radeon RX 6800 XT and RX 6700 XT setup.

Developers running local models can reclaim up to 1.5 GB of VRAM on multi-GPU or offloaded setups without sacrificing inference speed.

SOURCES

[1]

1. Apple Unveils Siri AI and Foundation Models Framework at WWDC 2026

2. Xiaomi and TileRT Push 1-Trillion-Parameter MoE Model Past 1000 TPS

3. DeepSeek V4 Pro Outperforms GPT-5.5 Pro on Precision Benchmark

4. xAI Releases grok-imagine-video-1.5-preview with Native Audio

5. Malicious Microsoft Packages Target AI Coding Agents in Supply-Chain Attack

6. LangSmith Launches Sandboxes for Secure Agent Execution

7. Cursor Updates Design Mode with Direct Element Interaction

8. Intuned Launches Code-First Browser Automation Platform with Self-Healing AI

9. OpenEnv Transitions to Open-Source Agentic Execution Environment

10. Amazon Bedrock Optimizes Console for Anthropic and OpenAI APIs

11. OpenAI Introduces Lockdown Mode to Prevent Prompt Injection

12. Google Research Introduces Agentic RAG for Multi-Hop Queries

13. Luce Spark Runs 35B MoE Models on 16GB GPUs Without Offload Penalties

14. Gemma 4 Performance Nearly Doubles on Consumer GPUs via QAT and MTP

15. Compiling llama.cpp with Custom Flag Saves 1.5 GB VRAM

Inference Brew in your inbox