Audesso | Daily: AI

Microsoft Launches Seven In-House MAI Models Led by MAI-Transcribe-1.5

00:00 / --:--

← Back to home

Microsoft Launches Seven In-House MAI Models Led by MAI-Transcribe-1.5

1. Microsoft Launches Seven In-House MAI Models Led by MAI-Transcribe-1.5

Microsoft announced seven new models in its MAI family at Build 2026. This release includes MAI-Image 2.5 (with a flash variant), MAI-Voice-2 supporting 15 new languages, and the flagship MAI-Thinking-1 reasoning model which matches top-tier software engineering benchmarks. On the transcription side, MAI-Transcribe-1.5 stands out as the fastest speech-to-text option in the top 10 on the Artificial Analysis leaderboard, complete with keyword biasing for domain-specific vocabularies.

  • MAI-Transcribe-1.5 runs at a 276x real-time speed factor, achieving 2.4% Word Error Rate on the AA-WER leaderboard.
  • Transcribe-1.5 is priced at $6 per 1,000 audio minutes via Microsoft Foundry and supports 43 languages.
  • MAI-Thinking-1 is a 35B parameter reasoning model trained from scratch with a 128K context window.
  • MAI-Code-1-Flash is an inference-efficient coding model integrated directly into GitHub Copilot and VS Code.

Offers developers high-speed speech transcription and new reasoning-focused alternatives, marking Microsoft's shift toward in-house models.

2. Alibaba Launches Qwen3.7-Plus with 1M Token Context and Deep Reasoning

Qwen3.7-Plus is a multimodal agent model designed to interpret text, video, and image inputs. It integrates GUI and CLI interactions into a unified agent loop, scoring 70.3 on the Terminal Bench 2.0-Terminus benchmark and 79.0 on ScreenSpot Pro. The text-only sibling model, Qwen3.7-Max, scored 56.6 on the Artificial Analysis Intelligence Index. The model does not support local weight deployment.

  • Supports a 1-million token context window, including 256K tokens for internal chain-of-thought processing.
  • Priced at $0.40 per million input tokens, with cached reads at $0.04 per million tokens.
  • Includes a 'preserve_thinking' API parameter to retain internal reasoning loops across multi-turn chats.
  • Requires access through Alibaba Cloud's international endpoints under a closed commercial license.

Offers a highly affordable, long-context multimodal model with a dedicated thinking parameter for robust multi-turn conversations.

3. AWS Bedrock Hosts OpenAI Models with Responses API Support

The new guidance in the OpenAI cookbook bridges OpenAI's model capabilities with AWS's cloud-native infrastructure. By utilizing the Responses API, developers can maintain standard patterns like structured data outputs and function calling under AWS Bedrock's hosting umbrella.

  • Demonstrates building production workflows with Bedrock-hosted OpenAI models.
  • Leverages the Responses API to support structured outputs, tool calling, and file inputs.
  • Provides operational guides for state management and prompt caching.

Enables AWS developers to run OpenAI models while easily leveraging Bedrock's structured outputs and tool calling features.

SOURCES

4. TinyFish Releases BigSet: Open-Source Multi-Agent Dataset Builder

TinyFish's BigSet framework streamlines data extraction by allowing developers to describe their target data in natural language. The system takes between 2 to 5 minutes to spin up sub-agents, gather details, and produce a fully attributed data table. To run the self-hosted Docker container, developers need API keys for TinyFish, OpenRouter, and Clerk.

  • Licensed under AGPL-3.0 and self-hosted via Docker.
  • Uses a schema inference model to define data structures and an orchestrator agent to coordinate parallel sub-agents.
  • Prevents prompt injection by isolating the dataset ID in an inaccessible JavaScript closure.
  • Supports scheduled data refreshes at intervals from 30 minutes to weekly, exporting results with source attribution.

Gives developers a self-hosted, secure tool to easily automate the collection and structuring of web data into clean CSV or XLSX files.

SOURCES

5. Microsoft Launches Execution Containers for Kernel-Level AI Agent Sandboxing

Microsoft Execution Containers (MXC) provide developers and administrators with a structured framework to safely execute AI agents. Partners like OpenAI, Nvidia, Manus, Nous Research, and the OpenClaw project are actively integrating MXC into their developer frameworks. Additionally, Microsoft announced Agent 365, scheduled for preview in July, to tie MXC operations with enterprise security suites like Defender and Purview.

  • Enforces policy-driven execution boundaries for AI agents at the Windows OS kernel level at runtime.
  • Supports a scalable isolation spectrum from lightweight process isolation to micro-virtual machines.
  • Binds each agent to a local or Microsoft Entra-backed identity for auditable action tracking.
  • Isolates agent execution from the desktop, clipboard, and input UI to prevent UI spoofing and cross-session leaks.

Enables developers to safely run potentially untrusted agent code by confining actions to a highly customizable OS-level sandbox.

SOURCES

6. Perplexity Introduces Search as Code (SaC) SDK for Custom Search Pipelines

Search as Code (SaC) shifts search architecture from static API calls to a model-driven process. By giving the orchestrating AI model direct control over search parameters, SaC allows for task-specific pipeline configuration, enabling highly robust and contextually accurate agentic searches.

  • Provides an SDK allowing AI models to programmatically configure search pipelines.
  • Designed to improve performance and cost-efficiency over monolithic search APIs.
  • Outperformed competitors in complex search benchmarks, specifically WANDR.

Enables developers to replace rigid search APIs with flexible pipelines configured dynamically by their LLMs.

SOURCES

7. Mistral Releases Open-Source Search Toolkit for AI Retrieval Pipelines

The Mistral Search Toolkit aims to simplify the engineering overhead of building production AI pipelines. By standardizing ingestion and retrieval interfaces, developers can more easily switch, optimize, and evaluate components in their search-grounded architectures.

  • Released in public preview as an open-source framework.
  • Designed to unify three core steps: data ingestion, retrieval, and evaluation.
  • Provides a shared interface for managing retrieval operations.

Gives developers a structured, open-source library to streamline data ingestion, retrieval, and evaluation within their RAG pipelines.

SOURCES

8. Microsoft Launches IQ and Rayfin SDK to Unify Agent Context and Data

Announced at Build 2026, Microsoft IQ and Rayfin solve a major hurdle for developers building complex enterprise agents: fragmented data storage and drifting user context. By standardizing the backend on OneLake via the Rayfin SDK, organizations can ensure that all agent-generated applications feed back into a centralized, governed organizational knowledge layer. Ontologies within Fabric IQ are expected to reach general availability soon.

  • Rayfin is an open-source SDK and CLI that deploys agent applications directly to Microsoft Fabric.
  • Microsoft IQ consolidates four context sources: Work IQ, Foundry IQ, Fabric IQ, and Web IQ.
  • Routes app data directly into Microsoft OneLake to prevent siloed storage.
  • Addresses the market shift where hybrid retrieval intent grew from 10.3% in January to 33.3% in March 2026.

Enables developers to deploy agent-built apps directly to a governed Microsoft Fabric backend while keeping context centralized.

SOURCES

9. Microsoft Open-Sources ASSERT for Spec-Driven AI Evaluation

ASSERT addresses the growing demand for rigorous, application-specific AI evaluation. The framework automatically generates scenario test cases, evaluates target system responses, and assigns regression scores based on user-defined constraints. Developers can provide custom system context and tools to tailor the testing environment to their specific integration needs.

  • Stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT).
  • Translates natural-language goals, policies, and behavior guidelines into portable, scored test suites.
  • Saves detailed execution traces, intermediate actions, and tool calls to simplify debugging.
  • Applicable across the entire development lifecycle, including pre-deployment building and continuous post-deployment monitoring.

Allows developers to quickly generate and run repeatable regression tests on agent behaviors using simple English descriptions.

10. Running Gemma 4 via LiteRT Delivers 2.4x Text Generation Speedup

Testing reveals that deploying Gemma 4 E4B models with Google's LiteRT engine offers a dramatic speed boost for text-generation tasks compared to standard llama.cpp implementations. The benchmark emphasizes that the speedup is largely on the text decoder side, as the vision encoder bottleneck remains mostly unchanged. Developers can use the author's open-source Python wrapper to spin up a compatible API endpoint locally.

  • LiteRT-LM 4B with multi-token prediction (MTP) achieved 157.2 tok/s, compared to 66.3 tok/s for llama.cpp Q4 GGUF on an RTX 4060ti.
  • Image captioning showed a modest 1.1x speedup, with the vision encoder acting as the main bottleneck.
  • OpenAI-compatible Python wrapper is available on GitHub to simplify integration.
  • Current limitations include deterministic output (ignores temperature), single-session execution, no batching, and Linux-only support.

Provides a clear local performance optimization path for developers integrating Gemma 4 4B models into Linux environments.

SOURCES

11. Benchmarks Rank Small LLMs for Repetitive Local Task Automation

The benchmark study evaluated small LLMs for specific system utility tasks, noting that models typically suffer a 20% to 35% reduction in generation speed when scaling context from 1k to 32k tokens. Additionally, the researcher observed that third-party fine-tunes frequently introduce issues like broken chat templates and hallucinated function names, reinforcing the value of relying on well-engineered base models for automation workflows.

  • Tested 20 models on a 6GB RTX 4050 using a custom 6-probe set targeting tool calls, instruction adherence, and plan decomposition.
  • LFM2.5-1.2B-Instruct identified as a fast, low-VRAM option, and Granite-4.1-3B served as the quality baseline.
  • Gemma-4-agentic-e2b recommended for long-context tasks with its 256k token support.
  • Liquidai's LFM2.5-8B-A1B selected as the top orchestrator, outperforming dense 8B models in speed and context utilization.

Helps developers select the most efficient and robust small-footprint model for local agent sub-tasks and background execution.

SOURCES

12. Evaluating Qwen3.6-27B as a Local Alternative to Claude for Agents

The evaluation confirms that while Qwen3.6-27B can serve as a viable local reasoning layer, it demands strict software mitigations to match cloud-based API models. To prevent cascading agent failures—which occurred in 3 out of 47 runs due to undetected sub-agent errors—developers must implement structured-output enforcement, plan-approval gates, and explicit failure-handling logic.

  • Tested Qwen3.6-27B at Q6_K quantization on an RTX 3090 (24GB VRAM) across 47 coding workflows using OpenYabby.
  • Achieved 95% schema validity for plan generation but exhibited a high 12% format error rate in JSON tool-calls.
  • Caught roughly 60% of bugs compared to Claude via a secondary Qwen auto-review instance.
  • Experienced long-context drift after 14k tokens, showing a practical limit of 12k tokens.

Offers concrete metrics and architectural recommendations for developers trying to replace cloud LLM APIs with self-hosted reasoning models.

SOURCES

13. Reducing Query-Time RAG Overhead with Ingestion-Time Image Description

According to Kapa's findings, performing query-time multimodal processing is economically inefficient and prone to payload limit errors. Storing image descriptions as separate text chunks rather than embedding them inline proved to be much more cost-effective. The system, which is currently rolling out in preview, is designed to handle technical documentation containing millions of images.

  • Describes images using a vision model at indexing time and stores the output as text chunks rather than processing images at query time.
  • Uses a zero-shot classifier at ingestion to filter out non-essential images like logos and banners.
  • Improves caption quality by providing the vision model with the surrounding text context during generation.
  • Achieved 94% to 99% correct image placement across three customer documentation assistant projects.

Provides a highly cost-effective pattern for implementing multimodal RAG over millions of documentation images without hitting query payload limits.

SOURCES

14. Comparing Web Search APIs for Clean Markdown RAG Processing

Choosing the right search API is critical for avoiding excessive token consumption and parsing noise in retrieval-augmented generation. While Tavily is widely used for agents, developers report mixed success regarding token overhead. For self-hosted, budget-friendly setups, SearXNG remains an option, though it requires custom post-processing to clean raw HTML before embedding.

  • Brave Search offers an LLM Context API providing pre-formatted, relevance-ranked Markdown chunks.
  • Parallel AI's Extract API compresses JS-heavy pages into dense Markdown tokens.
  • Exa features native Markdown extraction explicitly built for direct LLM ingestion.
  • Firecrawl and Jina Reader are designated tools to translate raw URLs to clean Markdown.

Helps developers select search endpoints that eliminate heavy scraping middleware and reduce token overhead in RAG pipelines.

SOURCES

15. Speeding Up Transformer Training with NVIDIA Apex Fused Kernels

This tutorial provides a clear path for modernizing training pipelines. Rather than relying on Apex's deprecated mixed-precision components, developers are guided to use PyTorch's native AMP while taking advantage of Apex's highly optimized fused CUDA kernels. Verifying kernel availability during runtime is highlighted as critical to prevent silent execution fallbacks to slower standard implementations.

  • Uses Apex primarily for high-performance fused kernels like FusedAdam, FusedLayerNorm, and FusedRMSNorm.
  • Advises pairing with native PyTorch torch.amp (autocast and GradScaler) rather than the deprecated apex.amp library.
  • Requires building Apex from source with CUDA and C++ extensions to ensure kernel availability.
  • Demonstrates throughput gains by benchmarking FusedAdam against PyTorch AdamW.

Helps developers optimizing custom model fine-tuning runs to achieve higher training throughput.

SOURCES

16. Optimizing DeepSeek-V4-Flash on AMD MI300X Hardware

While the AMD MI300X is available on-demand at lower rental prices than equivalent Nvidia hardware, deploying cutting-edge models like DeepSeek-V4-Flash with vLLM has historically required custom software workarounds. By developing tailored ROCm helpers and addressing FP8 exponent bias differences, engineers successfully bypassed the chip's core-level library coverage limitations to deliver high-throughput local inference.

  • AMD MI300X features 192GB of HBM3 memory, double the capacity of the NVIDIA H100 (80GB).
  • Optimizations bypassed 'fnuz' FP8 dialect incompatibilities with OCP-standard FP8 on newer AMD chips.
  • Utilized custom ROCm helpers to overcome uneven coverage in AMD's AITER tuned-kernel library for CDNA3 cores.
  • Achieved 2699 output tokens per second per GPU, representing an 8.6% performance improvement.

Provides a practical path for developers looking to slash hosting costs by running large open models on cheaper AMD hardware.

SOURCES

17. Microsoft Debuts Surface RTX Spark Dev Box with 128GB Unified Memory for Local AI

Unveiled at Build 2026, the Surface RTX Spark Dev Box represents Microsoft's push to transition intensive AI workloads from cloud API billing to fixed-cost local hardware. The compact machine acts as a spiritual successor to Qualcomm's canceled Snapdragon Dev Kit and is optimized for local-first AI development. It will be available in the US on the Microsoft Store later this year, though official pricing has not yet been announced.

  • Features an Nvidia Blackwell-architecture RTX Spark chip and 128GB of unified memory.
  • Rated at one petaflop of AI compute with a 100-watt thermal envelope.
  • Ships preconfigured with Windows 11 Pro, WSL 2, VS Code, Git, Python, and Node.js.
  • Designed with a 3D-printed metal chassis that acts as a passive heatsink.

Enables developers to run models with up to 120 billion parameters locally, bypassing per-token cloud costs.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.