Microsoft Launches MAI Model Family Led by MAI-Thinking-1 Reasoning Model

1. Microsoft Launches MAI Model Family Led by MAI-Thinking-1 Reasoning Model

At Build 2026, Microsoft unveiled seven new in-house AI models, signaling a shift toward proprietary model development. The flagship MAI-Thinking-1 is a 35B parameter reasoning model trained from scratch on clean data that matches leading models on software engineering benchmarks. For developers, the release also introduces MAI-Code-1-Flash, which is integrated directly into VS Code and GitHub Copilot, and MAI-Transcribe-1.5, an exceptionally fast speech-to-text model available on Microsoft Foundry.

• MAI-Thinking-1 is a medium-sized reasoning model with 35 billion active parameters and a 128K context window, trained from scratch on clean data.
• MAI-Code-1-Flash is an inference-efficient coding model integrated into GitHub Copilot and Visual Studio Code.
• MAI-Transcribe-1.5 is a speech transcription model achieving a speed factor of approximately 276x real-time and a 2.4% Word Error Rate on the Artificial Analysis leaderboard.
• MAI-Transcribe-1.5 is priced at $6 per 1,000 minutes of audio via Microsoft Foundry and supports 43 languages.
• Other models include MAI-Image 2.5 for text-to-image and MAI-Voice-2 with 15 new languages.

Developers get access to a new suite of specialized, in-house Microsoft models integrated directly into tools like GitHub Copilot and Visual Studio Code.

SOURCES

[1] [2] [3] [4] [5] [6]

2. Alibaba Releases Qwen3.7-Plus Multimodal Agent Model

Alibaba's Qwen team has released Qwen3.7-Plus, a closed-source multimodal model optimized for agentic workflows. The model integrates vision and language capabilities to act as a hybrid agent that can blend GUI and CLI interactions. It features a 1M token context window with 256K tokens dedicated to internal chain-of-thought, supported by a new preserve_thinking API parameter to maintain reasoning state across turns. Standard input is priced highly competitively at $0.40 per million tokens.

• Qwen3.7-Plus is priced at $0.40 per million input tokens and $0.04 per million cached read tokens.
• The model features a 1-million token context window, including 256K tokens for internal chain-of-thought processing.
• A new 'preserve_thinking' API parameter allows developers to maintain internal logic loops across multi-turn conversations.
• The model is proprietary, closed-source, and accessible via Alibaba Cloud's Bailian platform (Model Studio) international endpoints.
• It scored 70.3 on Terminal Bench 2.0-Terminus and 79.0 on ScreenSpot Pro, ranking 16th overall in the LM Arena Vision Arena.

Developers can build complex, multimodal agent loops combining visual and command-line interfaces at a significantly lower cost than previous models.

SOURCES

[1] [2] [3]

3. Microsoft Introduces MXC OS-Level Sandbox for AI Agents

Microsoft has introduced Microsoft Execution Containers (MXC), a security layer designed to safely run AI agents on Windows. MXC allows developers to define access boundaries that are enforced at the OS kernel level at runtime, isolating agent execution from the user's desktop, clipboard, and input devices to prevent UI spoofing and input injection. The system supports a range of sandboxing options from lightweight process isolation to micro-VMs, and is currently available in early preview with SDK support.

• MXC provides a policy-driven execution layer for AI agents enforced at the Windows OS kernel level at runtime.
• The sandbox spectrum ranges from lightweight process isolation to micro-virtual machines and full cloud instances.
• MXC isolates agent execution from the user's desktop, clipboard, UI, and input devices.
• Every agent is bound to a local or Microsoft Entra-backed identity for auditability and governance.
• Partners including OpenAI, Nvidia, Manus, Nous Research, and OpenClaw are integrating MXC into their frameworks.
• MXC is currently available in early preview for developers to test containment policies and build against the SDK.

Developers can build and run autonomous agents with strict, kernel-enforced security boundaries, mitigating risks like UI spoofing and input injection.

SOURCES

[1]

4. Microsoft Releases ASSERT Open-Source AI Evaluation Framework

Microsoft has open-sourced ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), a framework designed to simplify AI application testing. ASSERT uses AI to translate natural-language descriptions of goals, policies, or constraints into structured, scored test cases. During execution, the framework records the target system's intermediate actions and tool calls, allowing developers to inspect failure points and enforce custom agent policies using portable policy files.

• ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) is an open-source framework.
• It converts natural-language descriptions of goals, policies, or behaviors into structured, scored tests.
• The framework generates problem scenarios, runs them against a target system, and records intermediate actions and tool calls.
• Developers can customize evaluations by providing system context, tools, and constraints.
• It enables teams to define custom agent policies using portable policy files.

Developers can easily spin up repeatable regression tests and behavior evaluations for their AI applications using plain text.

SOURCES

[1] [2] [3]

5. Mistral Releases Open-Source Search Toolkit in Public Preview

Mistral has released its Search Toolkit in public preview. The open-source framework is designed to streamline production AI pipelines by unifying data ingestion, retrieval, and evaluation within a single, shared interface, making it easier for developers to build and benchmark RAG systems.

• Search Toolkit is released in public preview.
• It is an open-source framework.
• It unifies data ingestion, retrieval, and evaluation within a shared interface.

Provides a standardized, open-source interface for building and evaluating production RAG pipelines.

SOURCES

[1]

6. Perplexity Introduces Search as Code SDK for Agentic Search

Perplexity has introduced Search as Code (SaC), an SDK designed to modernize search architectures for AI applications. SaC shifts away from monolithic search systems by giving AI models direct control over the search process, allowing them to dynamically configure search pipelines tailored to specific tasks. Perplexity reports that this agentic search approach improves performance and cost-efficiency, outperforming competitors on complex benchmarks like WANDR.

• Search as Code (SaC) provides an SDK for direct model control over the search process.
• Models can configure search pipelines tailored to specific tasks.
• SaC has outperformed competitors in benchmarks, particularly in complex tasks like WANDR.
• Perplexity positions SaC as a robust, cost-effective agentic search capability.

Enables developers to build highly customized, task-specific search pipelines controlled directly by LLMs, outperforming monolithic search systems.

SOURCES

[1]

7. TinyFish Launches BigSet Open-Source Multi-Agent Data Extraction System

TinyFish has released BigSet, an open-source, AGPL-3.0 licensed multi-agent system designed to generate structured datasets from natural-language descriptions. Operating via Docker, BigSet uses a two-tier architecture where a schema inference model defines the data structure and an orchestrator coordinates parallel sub-agents to extract data. The system features scheduled refreshes, exports to CSV/XLSX, and secures against prompt injection by isolating the dataset ID in a JavaScript closure.

• BigSet is an open-source multi-agent system licensed under AGPL-3.0.
• It uses a two-tier agent architecture: a schema inference model defines the structure, and an orchestrator agent manages parallel sub-agents for extraction.
• The system supports scheduled dataset refreshes (from 30 minutes to weekly) and exports to CSV or XLSX.
• It is self-hosted via Docker and requires API keys for TinyFish, OpenRouter, and Clerk.
• Prompt injection is prevented by capturing the dataset ID in a JavaScript closure inaccessible to the LLM.

Developers can deploy a self-hosted, parallelized agent system to extract and refresh structured data from web sources with built-in prompt injection protection.

SOURCES

[1]

8. Microsoft Unveils Surface RTX Spark Dev Box for Local AI Development

Microsoft has announced the Surface RTX Spark Dev Box, a miniature desktop PC designed to shift AI workloads from cloud-based per-token pricing to local, fixed-cost hardware. Powered by Nvidia's Blackwell-architecture Arm-based RTX Spark processor, the device delivers one petaflop of AI compute and features 128GB of unified memory, allowing developers to run models exceeding 120 billion parameters locally. The unit runs Windows 11 Pro and comes pre-configured with essential developer tools including WSL 2, VS Code, and Python.

• The Surface RTX Spark Dev Box features Nvidia's Blackwell-architecture RTX Spark processor and 128GB of unified memory.
• It is rated at one petaflop of AI compute and designed to run models exceeding 120B parameters locally.
• The device features a 100-watt thermal envelope with a passively cooled, metal 3D-printed chassis acting as a heatsink.
• It ships with Windows 11 Pro pre-configured with WSL 2, VS Code, GitHub Copilot, Git, Python, and Node.js.
• It will be available in the US via Microsoft.com later in 2026; pricing has not yet been disclosed.

Provides developers with a dedicated local hardware option capable of running models with up to 120 billion parameters, bypassing cloud per-token costs.

SOURCES

[1] [2]

9. OpenAI Updates Codex Platform with Sites and Role-Specific Plugins

OpenAI has rolled out a significant update to its Codex agentic AI platform, introducing features aimed at streamlining document editing and application creation. The update features 'Sites,' a tool in preview for Business and Enterprise tiers that converts static text or data into interactive, OpenAI-hosted web applications. Additionally, a new 'Annotations' tool enables localized editing of specific document sections to prevent full-document regeneration, while six new role-specific plugins bundle integrations for tasks like data analytics and product design.

• The new Sites feature (in preview for Business/Enterprise) converts static data or text documents into functional, web-hosted internal applications hosted by OpenAI.
• The Annotations tool allows for localized editing of documents (spreadsheets, slides) without full-document regeneration.
• Six role-specific plugin categories aggregate 62 business applications and 110 automated skills (e.g., data analytics, product design).
• Features are accessible via the Codex CLI and desktop app under OpenAI's proprietary enterprise licensing.
• Pricing is integrated into existing subscription tiers ($20/month Plus, $100/month Pro) or a seat-free pay-as-you-go model.

Developers and enterprise teams can build and share interactive, hosted web applications directly from static data and documents.

SOURCES

[1] [2] [3]

10. Cursor Increases Teams Limits and Introduces Premium Seat

Cursor has announced updates to its Teams plan, expanding usage limits to better accommodate intensive development workflows. The company has introduced a new Premium seat tier tailored specifically for heavy agent users, alongside new administrative spending controls to help teams manage and monitor their usage costs.

• Cursor increased usage limits for its Teams plan.
• Introduced a new Premium seat specifically designed for heavy agent users.
• Added new spending controls for administrators to manage usage.

Teams using Cursor for heavy agentic coding workflows get higher limits and better cost-management controls.

SOURCES

[1]

11. Kapa Shares Indexing-Time Image Processing Strategy for RAG

Kapa has shared details on its image indexing strategy for RAG systems handling technical documentation. Rather than performing expensive multimodal processing at query time—which often hits payload limits—Kapa uses a vision model to describe images at indexing time, storing the captions as separate text chunks. This approach reduces per-query overhead to just 1% to 6% of text-only systems, filters out non-essential images with a zero-shot classifier, and achieved 94% to 99% correct image placement in early customer projects.

• Kapa processes images at indexing time using a vision model to generate text descriptions, rather than processing images at query time.
• This method reduces per-query overhead to 1% to 6% compared to text-only systems and avoids query-time payload limits.
• A zero-shot classifier is used to filter out non-essential images like logos and banners.
• Caption quality is improved by providing the vision model with surrounding text context.
• Storing captions as separate chunks is more cost-effective than embedding them inline.

Provides a highly cost-effective and reliable architectural pattern for developers building RAG systems over image-heavy technical documentation.

SOURCES

[1]

12. Microsoft Announces Microsoft IQ and Rayfin SDK for Fabric

To address data silos and context challenges in enterprise AI, Microsoft has announced Microsoft IQ and the open-source Rayfin SDK. Microsoft IQ unifies context across Work, Foundry, Fabric, and Web sources. The Rayfin SDK and CLI allow developers to deploy agent-built applications directly to Microsoft Fabric, routing application data into Microsoft OneLake to maintain a governed production backend.

• Microsoft IQ unifies four context sources: Work IQ, Foundry IQ, Fabric IQ, and Web IQ.
• Rayfin is an open-source SDK and CLI designed to deploy agent-built applications directly to Microsoft Fabric.
• Rayfin routes application data into Microsoft OneLake to prevent data silos outside the organization's context layer.
• Ontologies within Fabric IQ are expected to reach general availability in the coming months.

Developers building enterprise agents can prevent data silos by routing application data directly into a governed OneLake backend.

SOURCES

[1]

13. OpenAI Cookbook Guides Production Workflows on Amazon Bedrock

The OpenAI cookbook has released a new guide detailing how to build production workflows with OpenAI models hosted on Amazon Bedrock. The tutorial utilizes the Responses API to demonstrate key developer features such as structured outputs, tool calling, and file inputs, while also covering operational best practices like state management and prompt caching.

• The guide demonstrates building production workflows using OpenAI models hosted on Amazon Bedrock.
• It utilizes the Responses API to showcase structured outputs, tool calling, and file inputs.
• Covers operational best practices including state management and prompt caching.

Provides concrete, actionable code patterns for implementing structured outputs, tool calling, and prompt caching on Bedrock.

SOURCES

[1]

14. Gemma 4 E4B Achieves 2.4x Speedup with LiteRT Engine

A community benchmark has demonstrated that running Gemma 4 E4B models on Google's LiteRT engine yields a 2.4x speedup in text generation compared to llama.cpp GGUF format. Tested on an NVIDIA 4060ti, the LiteRT-LM 4B model with multi-token prediction achieved 157.2 tokens per second. While a Python wrapper is available on GitHub to expose an OpenAI-compatible endpoint, developers should note current engine limitations, including deterministic outputs, lack of request batching, and Linux-only support.

• Testing on an NVIDIA 4060ti GPU showed LiteRT-LM 4B with multi-token prediction (MTP) achieved 157.2 tok/s, compared to 66.3 tok/s for llama.cpp GGUF.
• Image captioning performance showed a minor 1.1x speedup, bottlenecked by the vision encoder.
• A Python wrapper was developed to create an OpenAI-compatible endpoint for the LiteRT model, available on GitHub.
• Current LiteRT-LM limitations include deterministic output regardless of temperature, single-session operation, lack of request batching, and Linux-only support.

Developers running local models on Linux can leverage LiteRT and multi-token prediction to significantly reduce inference latency.

SOURCES

[1]

15. Benchmark Evaluates 20 Small LLMs on 6GB GPU for Repetitive Tasks

A developer has published a benchmark evaluating 20 small LLMs on a budget 6GB RTX 4050 GPU for repetitive tasks like file organization and log triage. The study tested models on tool-calling, JSON strictness, and instruction adherence, finding that all models suffered a 20% to 35% speed drop when scaling context from 1k to 32k tokens. LFM2.5-1.2B-Instruct was highlighted as a fast, low-VRAM option, Granite-4.1-3B as a solid baseline, and LiquidAI's LFM2.5-8B-A1B as the top orchestrator.

• The evaluation tested 20 small models on a 6GB GPU using a 6-probe set for tool-calling, JSON strictness, instruction adherence, and path hallucination.
• All models experienced a 20% to 35% reduction in generation speed when context scaled from 1k to 32k tokens.
• LFM2.5-1.2B-Instruct was identified as a fast, low-VRAM option, while Granite-4.1-3B (instruct) served as the quality-per-VRAM baseline.
• The liquidai build of LFM2.5-8B-A1B was selected as the top orchestrator, outperforming dense 8B models in speed and usable context.
• Third-party fine-tunes frequently exhibited issues like hallucinated function names and broken chat templates.

Provides concrete performance and quality baselines for developers selecting small, local models for background automation tasks.

SOURCES

[1]

16. Developer Evaluates Qwen3.6-27B as Local Multi-Agent Orchestrator

A developer shared a two-week evaluation of replacing Claude with a local Qwen3.6-27B model (Q6_K quantization on an RTX 3090) in a multi-agent orchestrator across 47 coding workflows. While Qwen achieved 95% schema validity for plan generation, it exhibited a 12% format error rate in JSON tool-calls (compared to Claude's 0.5%) and suffered long-context drift after 14K tokens. The author concluded that Qwen3.6-27B is viable for local multi-agent systems only if paired with structured-output enforcement and re-plan-on-failure logic.

• Qwen3.6-27B (Q6_K quantization, 22GB VRAM on RTX 3090) was tested over two weeks in the OpenYabby orchestrator across 47 multi-step coding workflows.
• The model achieved 95% schema validity for plan generation and successfully extracted facts every 6 turns.
• It exhibited a 12% format error rate in JSON tool-calls, compared to 0.5% with Claude.
• An auto-review process using a second Qwen instance caught roughly 60% of the bugs Claude would have identified.
• The model suffered long-context drift after 14k tokens and failed to detect sub-agent failures in 3 out of 47 runs.

Highlights the practical limitations and necessary guardrails (like structured-output enforcement) when replacing commercial APIs with local models for agent workflows.

SOURCES

[1]

17. Optimizing DeepSeek-V4-Flash on AMD MI300X Accelerators

A technical deep dive details the process of bringing up DeepSeek-V4-Flash on AMD's MI300X accelerator, which offers 192GB of HBM3 memory at a lower rental price than equivalent Nvidia hardware. Because the MI300X utilizes the 'fnuz' FP8 dialect—which is incompatible with the OCP-standard FP8 used in newer AMD chips—developers implemented custom software workarounds and ROCm-specific helpers. These optimizations resolved compatibility issues in vLLM and achieved a high-throughput performance of 2,699 output tokens per second per GPU.

• The AMD MI300X features 192GB of HBM3 memory, more than double the 80GB capacity of the NVIDIA H100.
• Running vLLM with DeepSeek-V4-Flash on the MI300X required custom software workarounds due to FP8 dialect incompatibilities.
• The MI300X uses the 'fnuz' FP8 dialect, which differs in exponent bias from the OCP-standard FP8 used in newer AMD chips.
• Custom ROCm-specific helpers were required due to uneven coverage in AMD's AITER tuned-kernel library.
• Optimizations achieved 2,699 output tokens per second per GPU, representing an 8.6% performance improvement.

Enables high-throughput, cost-effective local hosting of DeepSeek models on readily available AMD hardware.

SOURCES

[1]

18. Developer Comparison of Web Search APIs for Local RAG Parsing

A community discussion has compiled evaluations of web search APIs optimized for returning clean, noise-free Markdown for local RAG parsing. Brave Search's LLM Context API and Exa (formerly Metaphor) were highlighted for providing pre-formatted Markdown chunks directly, while Parallel AI's Extract API was noted for compressing JavaScript-heavy pages. Other tools discussed include You.com's developer index, Tavily, and URL-to-Markdown converters like Firecrawl and Jina Reader.

• Brave Search offers an LLM Context API returning relevance-ranked, pre-formatted Markdown chunks.
• Parallel AI provides an Extract API designed to compress JavaScript-heavy pages into token-dense Markdown.
• You.com API offers a developer index with raw Markdown output.
• Exa (formerly Metaphor) features native Markdown extraction built specifically for LLMs.
• Tavily is popular but received mixed feedback regarding token overhead and noise filtering, while Firecrawl and Jina Reader are noted for URL-to-Markdown conversion.

Helps developers select search APIs that minimize token overhead and eliminate the need for custom scraping middleware in RAG pipelines.

SOURCES

[1]

19. Hugging Face Introduces Hardware Compatibility Check for Models

Hugging Face has introduced a hardware compatibility check feature directly on its model repository pages. This tool allows developers to instantly verify whether a specific model size or quantization level can run on their local hardware configuration, streamlining the model selection process for local deployment.

• Hugging Face introduced a hardware compatibility check feature for models on its platform.
• The feature helps developers assess if models (such as the 6B parameter dphnAI X1 Trinity Nano) can run on their hardware configurations.

Simplifies local model selection by instantly showing developers if a specific model size or quantization will run on their hardware.

SOURCES

[1]

1. Microsoft Launches MAI Model Family Led by MAI-Thinking-1 Reasoning Model

2. Alibaba Releases Qwen3.7-Plus Multimodal Agent Model

3. Microsoft Introduces MXC OS-Level Sandbox for AI Agents

4. Microsoft Releases ASSERT Open-Source AI Evaluation Framework

5. Mistral Releases Open-Source Search Toolkit in Public Preview

6. Perplexity Introduces Search as Code SDK for Agentic Search

7. TinyFish Launches BigSet Open-Source Multi-Agent Data Extraction System

8. Microsoft Unveils Surface RTX Spark Dev Box for Local AI Development

9. OpenAI Updates Codex Platform with Sites and Role-Specific Plugins

10. Cursor Increases Teams Limits and Introduces Premium Seat

11. Kapa Shares Indexing-Time Image Processing Strategy for RAG

12. Microsoft Announces Microsoft IQ and Rayfin SDK for Fabric

13. OpenAI Cookbook Guides Production Workflows on Amazon Bedrock

14. Gemma 4 E4B Achieves 2.4x Speedup with LiteRT Engine

15. Benchmark Evaluates 20 Small LLMs on 6GB GPU for Repetitive Tasks

16. Developer Evaluates Qwen3.6-27B as Local Multi-Agent Orchestrator

17. Optimizing DeepSeek-V4-Flash on AMD MI300X Accelerators

18. Developer Comparison of Web Search APIs for Local RAG Parsing

19. Hugging Face Introduces Hardware Compatibility Check for Models

Inference Brew in your inbox