1. Harness-1 20B Retrieval Subagent Released with Stateful Search Harness
Harness-1 introduces a stateful cognitive offloading architecture for retrieval agents. By separating the policy's semantic search decisions from the harness's bookkeeping tasks, the agent can efficiently manage document pools and evidence graphs. The model was trained using supervised fine-tuning on GPT-5.4 trajectories followed by on-policy CISPO reinforcement learning on SEC queries, resulting in state-of-the-art open-weights retrieval performance.
- • Harness-1 is a 20B retrieval subagent built on the gpt-oss-20b model by researchers from UIUC, UC Berkeley, and Chroma.
- • The agent separates semantic search decisions (handled by the policy) from routine bookkeeping (managed by a stateful harness).
- • The stateful harness maintains a candidate pool of up to 30 documents, an evidence graph using regex extraction, and a full-text store.
- • The policy utilizes eight specific tools including fan_out_search, search_corpus, grep_corpus, and read_document.
- • Harness-1 achieved an average curated recall of 0.730 across eight benchmarks, outperforming Tongyi DeepResearch 30B by 11.4 points.
- • Model weights and harness code are publicly available on Hugging Face and GitHub.
It provides developers with an open-weights agentic model specifically optimized for complex document search and retrieval, outperforming existing open alternatives.
2. Silurus Releases Browser-Based OOXML Viewer with Agent-Ready MCP Server
The @silurus/ooxml library provides a pixel-faithful rendering engine for Office Open XML documents using Rust-based parsers compiled to WebAssembly and a Canvas 2D API. Because it was built entirely by Claude, it is designed with modern AI integration in mind, shipping with a dedicated MCP server that allows developers to easily feed parsed document structures directly to LLM agents.
- • The @silurus/ooxml library renders DOCX, XLSX, and PPTX files directly to an HTML Canvas element in the browser.
- • The entire codebase, including Rust parsers and TypeScript renderers, was implemented by Anthropic's Claude AI assistant.
- • The project includes a Rust-based Model Context Protocol (MCP) server to allow AI agents to parse and read Office documents.
- • Security features include a default 512 MiB limit on uncompressed ZIP entries to prevent zip-bomb attacks and XXE safety via roxmltree.
- • The library is fully open-source under the MIT license and does not perform network requests by default.
It enables developers to build secure, client-side Office document rendering and easily expose document contents to AI agents via a pre-built MCP server.
3. Open-Source 'Automated Doubt' Pipeline Audits LLM Code with Subagents
To address the reliability issues of LLM-generated code, this automated doubt development process introduces a structured, multi-phase auditing pipeline. Instead of using subagents to write code, the workflow relies on a single Claude Code terminal instance for development, while deploying specialized validator agents to aggressively audit the design, implementation, and API contracts before shipping.
- • The 'automated doubt' process uses specialized subagents to audit code, specifications, and documentation across three phases.
- • Phase 1 (Design) uses agents like the Pre-Implementation Architect, Documentation Validator, and Assumption Excavator.
- • Phase 2 (Development) employs a Code Validator, Type Safety Validator, and Security Analyst to audit code quality.
- • Phase 3 (Ship) utilizes an API Contract Validator and Release Readiness Validator to verify release readiness.
- • The author recommends the Assumption Excavator as a universally applicable agent and has made the pipelines available on GitHub.
It provides a concrete, multi-agent auditing pattern that developers can adopt to mitigate trust and reliability issues with AI-generated code.
4. Nightwatch Launches Open-Source, Local-First AI SRE Agent
Nightwatch provides a secure, read-only AI SRE agent designed to investigate live systems and form root-cause hypotheses for on-call engineers. By keeping credentials local and masking sensitive data like secrets and IP addresses before making remote LLM calls, the tool ensures production security while leveraging tool-calling models to automate incident triage.
- • Nightwatch is a local-first, read-only monitoring layer that groups alert storms into incidents and identifies noisy checks.
- • The architecture uses 'baby owl' agents residing in local environments that make outbound connections to a central brain.
- • The system operates without requiring inbound access into production environments.
- • For remote LLM calls, Nightwatch masks sensitive data (secrets, IPs, hostnames, paths) with reversible placeholders.
- • Clustering and recommendation features function fully offline without the use of LLMs.
It offers developers an agentic, privacy-preserving SRE tool that can troubleshoot production systems without requiring inbound access or exposing raw credentials.
5. GEPA Framework Automates Multi-Component Prompt Optimization
The GEPA framework automates the tedious process of prompt engineering by treating prompt optimization as an evolutionary loop. By pairing a task model with a reflection model, GEPA evaluates performance against a training set, generates structured feedback on reasoning and formatting failures, and refines the prompt components to ensure generalization to a held-out validation set.
- • GEPA is a reflective prompt-evolution framework that simultaneously evolves instruction fields and output-format rules.
- • The optimization process utilizes a weak seed prompt, a deterministic benchmark dataset, a structured evaluator, and a reflection model.
- • The framework uses gpt-4o-mini as the task model and gpt-4.1 as the reflection model.
- • The evaluator scores outputs based on correctness and strict adherence to formatting rules.
- • GEPA provides structured feedback to the reflection model to identify failures related to reasoning, formatting, or both.
It gives developers a systematic, programmatic method to evolve and validate complex prompts on deterministic datasets rather than relying on manual trial-and-error.
6. Dockerized Nemotron 3.5 ASR Achieves 4.5x Realtime Speed on CPU
Transitioning from Parakeet to Nemotron 3.5 ASR enables native streaming speech recognition without the latency of buffering entire audio files. The newly shared Docker container and API examples make it easy for developers to deploy this multilingual model on standard CPU hardware using the onnxruntime-genai backend.
- • Nemotron 3.5 ASR has been packaged into a Docker container with example files for API calls.
- • The model supports over 40 locales within a single model, offering improved multilingual support over Parakeet.
- • It utilizes a native streaming architecture that eliminates the need to buffer entire audio files.
- • Testing on CPU using the onnxruntime-genai backend achieved approximately 4.5x realtime speed.
It provides a highly efficient, self-hostable speech-to-text pipeline that supports over 40 locales and runs fast on standard CPU hardware without requiring expensive GPUs.
7. NVIDIA Details Defensive LLM Red-Teaming Workflow with garak
NVIDIA's tutorial on the garak framework outlines a structured approach to LLM security. Developers can inspect the garak plugin ecosystem to discover available probes, detectors, and generators, run scans against their model endpoints via a REST configuration template, and analyze the resulting safety scores to harden their applications against prompt injection and other vulnerabilities.
- • NVIDIA garak is a framework designed for defensive LLM red-teaming.
- • The workflow covers plugin discovery, dry runs, real-model scans, multi-probe evaluations, and custom probe/detector creation.
- • Garak reports can be analyzed using pandas and NumPy to calculate safety scores and attack success rates.
- • The framework supports exporting vulnerability reports in the structured AVID format.
- • A REST configuration template is provided to connect garak to external model endpoints.
It helps developers systematically scan their LLM integrations for vulnerabilities, calculate safety scores, and export structured reports before shipping.