1. Anthropic Suspends Claude Fable 5 and Mythos 5 Globally Following US Export Control Order
Following an emergency export control directive from the US government, Anthropic has taken its Claude Fable 5 and Claude Mythos 5 models offline globally. The directive, which restricted access to US citizens only, was triggered by cybersecurity concerns and reports of a jailbreak that bypassed safety guardrails. Because the restrictions barred foreign nationals—including Anthropic's own international researchers—from accessing the systems, the company disabled the models entirely for all customers. Anthropic has disputed the severity of the jailbreak, arguing that similar capabilities exist in other frontier models like OpenAI's GPT-5.5, and is working to resolve the issue with the administration.
- • The US Commerce Department issued an export control directive restricting access to Claude Fable 5 and Mythos 5 for all foreign nationals.
- • Anthropic disabled global access to both models for all users, including enterprise customers and its own internal staff, to ensure immediate compliance.
- • The directive was issued just three days after the public release of Fable 5 and Mythos 5.
- • The government's action was reportedly triggered by a jailbreak method bypassing safety guardrails for cybersecurity, chemistry, and biology prompts.
- • Active sessions for the affected models now return errors, and API requests are being automatically routed to older models like Opus 4.8.
Developers using or planning to integrate Claude Fable 5 or Mythos 5 must immediately migrate to other models as global access has been completely suspended.
2. GLM 5.2 Released with 1M Context Window and Upcoming Open-Weight MIT License
Zhipu AI has announced GLM 5.2, a new model featuring a 1-million-token context window and specialized thinking modes designed for complex coding tasks. The model is currently available via API, with an open-weight version scheduled for release next week under the permissive MIT license. Early developer benchmarks show strong performance, with the model successfully generating a nearly functional Pac-Man clone in a single-shot test. While it operates at a slightly slower speed of 70 tokens per second compared to GLM 5.1, its advanced reasoning capabilities and open-source licensing make it a strong candidate for local deployment.
- • GLM 5.2 features a 1-million-token context window and is currently deployed in the GLM Coding Plan.
- • The model will be released as an open-weight model under the permissive MIT license next week.
- • It introduces two thinking modes, 'max' and 'high', with 'max' recommended for complex coding tasks.
- • In early developer testing, GLM 5.2 successfully generated a nearly functional Pac-Man clone in a single shot.
- • The model operates at approximately 70 tokens per second, making it slightly slower than its predecessor, GLM 5.1.
Developers get access to a new open-weights model with a 1M context window and strong coding capabilities that can be self-hosted under the MIT license.
3. Open-Source LLMOps Platform TensorZero Archived Overnight Following $7.3M Seed Round
TensorZero, an open-source, self-hosted LLMOps gateway built in Rust, has abruptly archived its GitHub repository overnight. The move comes immediately after the company announced a $7.3 million seed funding round. TensorZero is widely used for gateway routing, observability, and prompt optimization, supporting major API providers and accounting for roughly 1% of global LLM API spend. While the company offers a paid complementary product called TensorZero Autopilot, the sudden archiving of the core open-source repository leaves self-hosted deployments without an active upstream open-source path.
- • TensorZero has archived its open-source repository overnight following a $7.3 million seed funding announcement.
- • The platform is a self-hosted LLMOps gateway built in Rust, achieving sub-1ms p99 latency overhead.
- • TensorZero supports major LLM providers including OpenAI, Anthropic, AWS Bedrock, and Google Vertex AI.
- • The platform reportedly handles approximately 1% of global LLM API spend.
- • The company also offers TensorZero Autopilot, a paid automated AI engineer that optimizes prompts and models.
Developers relying on the open-source TensorZero gateway for LLMOps need to be aware that the repository has been abruptly archived following its seed funding round.
4. Pi-Setup Offers an Open-Source, Local Alternative to Claude Code
The open-source Pi-Setup project has emerged as a highly customizable, local-first alternative to Claude Code. Designed to run local models like Qwen 3.6 27B, the terminal interface integrates an advisor extension (typically configured with GPT-5.5) and provides a custom footer that tracks token usage, cost, and inference speed in real time. It also features a context breakdown command, a configurable permission system, custom skills, and a sync script for multi-environment setups.
- • Pi-Setup is an open-source terminal interface designed to run local models like Qwen 3.6 27B.
- • The setup features a custom footer displaying real-time token usage, cost, and inference speed.
- • It includes a context breakdown command similar to the native claudecode tool.
- • The system provides a configurable permission system, support for custom skills, and 10 built-in themes.
- • A sync and backup script is included to facilitate deployment across multiple development environments.
Developers looking for an alternative to Claude Code can use this open-source terminal setup to run local models with token tracking, custom extensions, and permission controls.
5. Dual-GPU Setup Achieves 80+ Tokens/Sec on Qwen 3.6 27B Using Speculative Decoding
A developer has detailed a hardware and software configuration that achieves 80 to 90+ tokens per second running the Qwen 3.6 27B Q8 model locally. By pairing an NVIDIA RTX 5080 with a refurbished RTX 3090 on an Asus Prime X570-Pro motherboard, the setup splits the PCIe lanes into two 8x slots. The configuration utilizes llama.cpp compiled with support for both Ampere and Blackwell architectures, leveraging speculative decoding and distributing the workload across both GPUs to maximize local inference performance.
- • The hardware configuration pairs an NVIDIA RTX 5080 with a refurbished RTX 3090 on an Asus Prime X570-Pro motherboard.
- • The setup achieves 80 to 90+ tokens per second running the Qwen 3.6 27B Q8 model.
- • Speculative decoding is enabled via llama.cpp compiled with support for both Ampere and Blackwell architectures.
- • BIOS adjustments require disabling CSM, enabling Above 4G Decoding, enabling ReSize BAR, and setting PCIe link modes to Gen 4.
- • The llama-server configuration uses the '-ts 2,3' flag to distribute the workload across the two GPUs.
Developers running local models can configure a mixed-generation dual-GPU setup to achieve high-speed inference on 27B models using speculative decoding.