MiniMax Releases M3 Model with 1M-Token Context and Desktop Control

1. MiniMax Releases M3 Model with 1M-Token Context and Desktop Control

MiniMax officially released the M3 model on June 1, 2026, introducing the MiniMax Sparse Attention (MSA) architecture. This architecture uses a KV outer gather Q approach to significantly reduce compute demands compared to previous generations. In addition to its 1-million-token context window, M3 supports native image and video input and can operate a desktop computer. The model is accessible through MiniMax Code, the Token Plan, and MiniMax's API services, with subscription plans starting at $20 per month.

• MiniMax M3 features a 1M-token context window powered by the new MiniMax Sparse Attention (MSA) architecture.
• The MSA architecture reduces per-token compute to 1/20th of previous models, yielding a 9x prefill and 15x decoding speedup at 1M tokens.
• The model scored 59.0% on SWE-Bench Pro and 70.06% on OSWorld-Verified, and can natively process image/video inputs and operate a desktop.
• Weights will be released under an open-weights license within 10 days, and API pricing is temporarily discounted to $0.30/M input and $1.20/M output tokens.

It provides developers with a highly efficient long-context model that achieves strong agentic performance at a heavily discounted introductory API price.

SOURCES

[1] [2] [3] [4]

2. NVIDIA Releases Cosmos 3 Open-Weights Physical AI Models

NVIDIA's Cosmos 3 integrates physical reasoning, world generation, and action generation. The architecture consists of a Reasoner tower for multimodal observation interpretation and a Generator tower for physics-aware video and action output. The release includes open-source model checkpoints on Hugging Face, training scripts on GitHub, and six synthetic data generation datasets. The models are available as NVIDIA NIM microservices, with the Reasoner NIM currently available and the Generator NIM forthcoming.

• NVIDIA released Cosmos 3 in 16B (Nano) and 64B (Super) parameter variants under the OpenMDW 1.1 license.
• The models use a Mixture-of-Transformers architecture pairing an autoregressive reasoner with a diffusion generator.
• Cosmos 3 Super variants achieved the #1 open-weights ranking for Text-to-Image and Image-to-Video on the Artificial Analysis Leaderboards.
• Model weights, training scripts, and datasets are available on Hugging Face and GitHub, with NIM microservices supporting NVFP4 quantization for a 2x speedup.

Developers can self-host these models for multimodal observation and video generation using the OpenMDW 1.1 license and Hugging Face checkpoints.

SOURCES

[1] [2] [3]

3. NVIDIA Announces 550B Nemotron 3 Ultra Open-Weights Model

Announced during Jensen Huang's Computex keynote, Nemotron 3 Ultra is positioned as the most intelligent US open-weights model currently available. The model utilizes 90% sparsity to maintain 55 billion active parameters out of its 550 billion total parameters. While it scored lower than the Kimi K2.6 model (54) on the Artificial Analysis Intelligence Index, it outperformed several other open-weights models.

• Nemotron 3 Ultra features 550B total parameters with 55B active parameters via 90% sparsity.
• The model scored 48 on the Artificial Analysis Intelligence Index, outperforming Gemma 4 31B (39) and Nemotron 3 Super (36).
• It achieved speeds exceeding 300 tokens per second on a pre-release DeepInfra endpoint.
• The model is released in BF16 weights, with NVFP4 quantization planned for future release.

It provides developers with a highly capable US open-weights model option that runs at over 300 tokens per second on optimized endpoints.

SOURCES

[1] [2]

4. OpenAI Frontier Models and Codex Launch on Amazon Bedrock

The general availability of OpenAI frontier models and Codex on AWS enables enterprise customers to utilize these capabilities within their existing cloud infrastructure. This integration supports AWS-native security and governance controls. Additionally, OpenAI's upcoming Daybreak suite on AWS is designed to assist cyber defenders with secure code review, threat modeling, patch validation, dependency risk analysis, and remediation guidance.

• OpenAI frontier models and Codex are now generally available on Amazon Bedrock in both Commercial and GovCloud regions.
• The integration allows AWS customers to run OpenAI models within their existing compliance, procurement, and security workflows.
• Codex on Bedrock is available to assist development teams with code review, debugging, and modernization.
• OpenAI plans to bring its Daybreak cyber security suite, including Codex Security, to AWS in a future release.

Developers deploying to AWS can now integrate OpenAI models directly via Amazon Bedrock, leveraging AWS-native governance controls and unified billing.

SOURCES

[1]

5. Compromised Red Hat npm Packages Target Claude Code and VS Code

On June 1, 2026, StepSecurity discovered malware within the @redhat-cloud-services npm scope. The malicious payload is contained in a 4.2 MB index.js file triggered by a preinstall script in the package.json file, utilizing four layers of obfuscation. Exfiltration traffic is routed through api.github.com to mimic legitimate GitHub API activity. StepSecurity has filed disclosure issues for three affected repositories, including RedHatInsights/platform-frontend-ai-toolkit.

• StepSecurity identified malware in the @redhat-cloud-services npm scope that executes automatically during npm install via a preinstall script.
• The malware targets credentials for AWS, GCP, Azure, Kubernetes, HashiCorp Vault, GitHub Actions, and CircleCI.
• It establishes persistence by injecting a SessionStart hook into Claude Code settings and a folderOpen task into VS Code workspace configurations.
• The self-propagating worm uses stolen npm tokens and the bypass_2fa parameter to republish backdoored versions of 32 affected packages.

Developers using these packages must immediately audit their environments, as the malware injects malicious hooks directly into Claude Code settings and VS Code workspace configurations.

SOURCES

[1] [2]

6. xAI Launches Grok Build 0.1 API for Agentic Coding

The grok-build-0.1 model is now available in public beta, providing a high-speed option for developers building agentic coding workflows. The model is designed to process data at speeds exceeding 100 tokens per second, making it highly responsive for real-time debugging and web development tasks.

• xAI released the grok-build-0.1 model in public beta via its API.
• The model is optimized for agentic coding, specifically web development and debugging, processing over 100 tokens per second.
• API pricing is set at $1 per million input tokens and $2 per million output tokens.
• It integrates directly with developer platforms including Cursor, Grok Build, and OpenClaw.

Developers can immediately integrate this model into their daily workflows via Cursor or OpenClaw for fast, agentic coding at $1 per million input tokens.

SOURCES

[1]

7. Microsoft to Unveil MAI-Thinking-1 Reasoning Model at Build

Microsoft's Build conference in San Francisco is set to focus heavily on AI and developer-focused Windows improvements. Microsoft AI chief Mustafa Suleyman is expected to lead the announcements, which include new local AI models and a rewritten Windows 11 user experience. Demonstrations will also feature Microsoft Scout, an AI agent based on the company's OpenClaw work.

• Microsoft is expected to unveil MAI-Thinking-1, its first native, non-distilled reasoning model, at its Build conference.
• The company will introduce MAI-Image-2.5 and MAI-Image-2.5-Flash models, alongside a preview of a Copilot "super app" in late summer.
• Microsoft will emphasize running local AI models on Windows to leverage local compute instead of cloud APIs.
• A new developer-optimized Windows 11 experience will provide a distraction-free environment with pre-installed developer tools and scripts.

Developers will gain access to new native reasoning and image models, plus a developer-optimized Windows 11 environment designed for local AI execution.

SOURCES

[1]

8. Memory OS Launches 6-Layer Open-Source Memory Stack for Hermes Agents

Released on May 31, 2026, Memory OS offers a structured approach to managing agent memory. The system utilizes a four-level fallback cascade for retrieval and a weekly decay scanner to manage memory bloat. While the project is in its early development stages and requires a complex setup of multiple services, it provides a highly customizable memory stack for developers building with Hermes.

• Memory OS is an MIT-licensed memory architecture designed to run locally using Docker, Qdrant, Redis, and Python.
• The system implements six memory layers, including workspace files, session history, structured facts, and an auto-curated LLM wiki.
• It features a gated, deduplicated retrieval process during pre_llm_call and captures new information during post_llm_call and on_session_end.
• The architecture is compatible with any LLM provider supported by the Hermes Agent, though it currently lacks published benchmarks.

It provides developers with a production-ready, local memory stack featuring automatic curation, vector storage, and decay scanning to prevent context bloat.

SOURCES

[1]

9. Pi Assistants Gain Multi-Agent Workflows via pi-dynamic-workflows

The pi-dynamic-workflows extension allows Pi assistants to orchestrate complex tasks by fanning out work to subagents. These subagents operate in isolation to perform specific actions, such as reading files or executing shell commands, before the main assistant synthesizes the final results.

• The pi-dynamic-workflows extension introduces a workflow tool for Pi assistants.
• Assistants can execute JavaScript scripts to distribute tasks across multiple isolated subagents.
• Subagents are capable of reading files, executing shell commands, and generating structured output.
• The tool is designed for codebase audits, multi-perspective reviews, large refactors, and fan-out research.

It provides a structured framework for developers to automate complex, multi-step tasks like codebase audits and large-scale refactoring.

SOURCES

[1]

10. AgentControl Launches Production Monitoring and Steering for AI Agents

AgentControl addresses the challenges of managing autonomous agents in live environments. By providing real-time visibility and steering capabilities, the platform allows developers to block undesirable behaviors and experiment with variations of agent behavior dynamically.

• AgentControl is a production monitoring and management tool for AI agents.
• The platform allows developers to view agent activities, block unwanted behaviors, and steer responses in real time.
• Users can experiment with and iterate on agent behaviors without undergoing a full deployment cycle.
• The tool is currently available for a free trial.

It gives developers immediate visibility and runtime control over autonomous agents without needing to redeploy code.

SOURCES

[1]

11. JetBrains Open-Sources Mellum-2 Coding-Focused MoE Models

The Mellum-2 series is designed as a fast model option for AI workflows. Hosted on Hugging Face and documented in an arXiv technical report, the models are optimized for coding tasks, though JetBrains notes they perform worse than Qwen 3.5 4B in non-coding domains.

• JetBrains open-sourced the Mellum-2 small Mixture-of-Experts (MoE) model series on Hugging Face.
• The reasoning model in the series delivers coding performance comparable to Qwen 3.5 9B.
• Mellum-2 is optimized for speed in AI workflows but performs worse than Qwen 3.5 4B on non-coding tasks.

It offers developers a lightweight, local alternative for code generation that performs comparably to larger models like Qwen 3.5 9B.

SOURCES

[1] [2]

12. llama.cpp b9455 Fixes Multi-GPU Quantized KV Cache Bug

The llama.cpp release b9455 addresses a critical multi-GPU issue on master. By extending the ggml_backend_meta_split_state specification, the fix allows the meta backend to restore the correct data layout after reshaping, resolving the shape information loss caused by tensor flattening.

• llama.cpp release b9455 fixes a bug that occurred when using --sm tensor with a quantized KV cache on multi-GPU setups.
• The bug was caused by tensor flattening during KV cache rotation, which stripped shape information needed by the meta backend.
• The fix extends the ggml_backend_meta_split_state specification to track segment repetition frequency, restoring the correct layout.
• The implementation is fully backwards-compatible and requires no changes to existing llama.cpp compute graphs.

Developers running local multi-GPU inference can now safely use quantized KV caches with tensor splitting without modifying their existing compute graphs.

SOURCES

[1]

13. ByteDance Releases Bernini Video Generation Model on Hugging Face

ByteDance has dropped Bernini on the Hugging Face platform. The model is designed to generate or edit videos using text, images, or references as input, and is claimed to rival the performance of leading closed-source video generation models.

• ByteDance released the Bernini video generation and editing model on Hugging Face.
• The model accepts text, images, or reference videos as inputs to generate or edit video content.
• Bernini is positioned to rival the performance of leading closed-source video generation models.

Developers can access and host a highly capable video generation model locally or via Hugging Face to build custom video editing workflows.

SOURCES

[1]

14. Snowflake CSTO Advocates Intent-Based Permissions and MCP Gateways for AI Agents

Mayank Upadhyay, Chief Security & Trust Officer at Snowflake, argues that security adoption fails when the secure path is more difficult to use than the insecure one. He notes that AI agents often possess excessive permissions, increasing the attack surface. To address this, he advocates for task-specific credentials and suggests MCP gateways as a practical method for centrally encoding governance rules.

• Snowflake's CSTO recommends granting agents task-specific, short-lived credentials based on intent rather than static, excessive permissions.
• MCP gateways are highlighted as a practical tool to centrally encode governance rules across multiple agent-to-tool connections.
• The author advises using cloud workload identity to eliminate static keys and close the 20% visibility gaps that AI agents can exploit.

Developers building agentic workflows can use MCP gateways to centrally manage and govern multiple agent-to-tool connections, reducing the attack surface of over-privileged agents.

SOURCES

[1]

1. MiniMax Releases M3 Model with 1M-Token Context and Desktop Control

2. NVIDIA Releases Cosmos 3 Open-Weights Physical AI Models

3. NVIDIA Announces 550B Nemotron 3 Ultra Open-Weights Model

4. OpenAI Frontier Models and Codex Launch on Amazon Bedrock

5. Compromised Red Hat npm Packages Target Claude Code and VS Code

6. xAI Launches Grok Build 0.1 API for Agentic Coding

7. Microsoft to Unveil MAI-Thinking-1 Reasoning Model at Build

8. Memory OS Launches 6-Layer Open-Source Memory Stack for Hermes Agents

9. Pi Assistants Gain Multi-Agent Workflows via pi-dynamic-workflows

10. AgentControl Launches Production Monitoring and Steering for AI Agents

11. JetBrains Open-Sources Mellum-2 Coding-Focused MoE Models

12. llama.cpp b9455 Fixes Multi-GPU Quantized KV Cache Bug

13. ByteDance Releases Bernini Video Generation Model on Hugging Face

14. Snowflake CSTO Advocates Intent-Based Permissions and MCP Gateways for AI Agents

Inference Brew in your inbox