Build Complete LLM Observability Pipelines with Langfuse

1. Build Complete LLM Observability Pipelines with Langfuse

Langfuse provides an open-source LLM engineering platform designed to handle tracing, prompt management, and automated evaluation. This comprehensive pipeline supports dataset-based experiments with custom item-level and aggregate evaluators, helping developers iterate on their applications with confidence. By using either the LangChain callback handler or native decorator-based tracing, developers can track session metadata and scoring metrics seamlessly in production.

• Supports both decorator-based tracing and manual instrumentation for RAG pipelines.
• Enables central management of prompts and numeric, categorical, and boolean evaluation scores.
• Includes a dedicated callback handler for easy integration with LangChain.
• Supports propagating metadata like user IDs, session IDs, and tags across LLM traces.
• Compatible with both real OpenAI API keys and deterministic mock LLMs.

Allows developers to easily implement robust telemetry, run dataset-based experiments, and centrally manage prompts using either live APIs or mock LLMs.

SOURCES

[1]

2. Microsoft Releases Webwright, a Terminal-Native Web Agent Framework

Microsoft Research has open-sourced Webwright, a highly efficient, terminal-native framework for web agents. Instead of predicting step-by-step UI actions, agents built with Webwright write and run Playwright code and bash commands in a terminal environment. The framework features dynamic history compaction to handle long sequences and enforces a mandatory validation cycle to ensure task completion before exiting.

• Achieved 86.7% on Online-Mind2Web and 60.1% on Odysseys with GPT-5.4.
• Consists of three core components: Runner, Model Endpoint, and terminal Environment under 1,000 lines.
• Compacts prompt history every 20 steps to mitigate context-length limitations.
• Prevents premature completion by requiring self-reflection and validation processes.
• Allows smaller models like Qwen3.5-9B to hit 66.2% accuracy when using pre-built scripts.
• Scripts are reusable and compatible with tools like Claude Code, Codex, and OpenClaw.

Improves web agent reliability and avoids context limits by replacing basic step prediction with full Playwright code execution and automated history compaction.

SOURCES

[1]

3. StepFun Releases StepAudio 2.5 Realtime End-to-End Voice Model

Shanghai-based StepFun has launched StepAudio 2.5 Realtime, a speech model that natively bypasses separate STT and TTS steps by processing raw audio-to-audio. Trained using algorithmic augmentation on over 10,000 seed personas, the model exhibits robust persona consistency and can analyze acoustic nuances to read user mood and intent. Developers can easily hook their apps into this low-latency voice capabilities using standard WebSocket streams.

• Accessible via WebSocket at wss://api.stepfun.com/v1/realtime with model identifier step-2.5-realtime.
• Functions as a unified system processing direct audio input to direct audio output.
• Supports both English and Chinese languages.
• Trained using roleplay-specific RLHF to maintain persona consistency across dialogues.
• Capable of paralinguistic perception, interpreting tone, speed, and laughter.
• Achieved a subjective human evaluation score of 80.41 in April 2026 benchmarks.

Enables low-latency, native audio-to-audio streaming interfaces with advanced paralinguistic perception for voice-driven AI applications.

SOURCES

[1]

4. hipEngine Delivers Fast ROCm-Native Inference on AMD RDNA3

hipEngine is a new open-source, ROCm-native local inference engine designed specifically for AMD's RDNA3 hardware. By bypassing heavy PyTorch dependencies and utilizing native libraries like hipGraph and AOTriton, hipEngine achieves high-efficiency execution. Its native INT8 KVCache optimization unlocks ultra-long context capabilities, making it a viable alternative to llama.cpp for local development pipelines.

• Built natively using Python and HIP/C++ with AMD libraries hipBLASLt, hipGraph, and AOTriton.
• Supports ParoQuant and GGUF model formats, including Q4_K_M and Q4_K_S variants.
• Includes near-lossless INT8 KVCache, letting Qwen 3.6 run with a 256K context in under 24GB of memory.
• Performs competitively with llama.cpp on gfx1100 hardware benchmarks.
• Includes KERNELS.md, ROOFLINE.md, and LESSONS-LEARNED.md documentation.
• Kernel optimizations were generated using AI-assisted development tools.

Allows developers using AMD consumer hardware, like Strix Halo or 7900 XTX, to run massive context models locally without heavy PyTorch dependencies.

SOURCES

[1]

5. Uncensored Genesis Qwen 3.6 35B Local Quantized Formats Released

The newly released uncensored variant of Qwen 3.6 35B offers high-context stability for local deployments. Testing shows that under optimal settings, the model maintains reliable behavior across massive 200k token sessions. To prevent performance degradation, developers must initialize the model with its specific Alibaba Cloud system prompt and adhere to the recommended sampler parameters.

• Available in GGUF, FP8 Safetensors, and FP8 MTP-Safetensors.
• Tested successfully on Strix Halo hardware using Q8_K_P MTP quantization with no loops or glitches up to 200k context.
• Retains task-switching stability past 120k tokens in benchmark runs.
• Supports APEX, APEX Compact quantization, MTP, and MLX conversions.
• Requires a specific system prompt beginning with 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' to function optimally.
• Recommended settings include 0.7 temperature, 20 Top K, 1.5 Presence Penalty, and 1.0 Repeat Penalty.

Provides developers with a highly stable uncensored model capable of long-context tasks without repeating loops when configured correctly.

SOURCES

[1]

6. IBM Releases granite-docling-2stage-258m for Robust Document Parsing

IBM has updated its open OCR and document-parsing line with granite-docling-2stage-258m. The model enhances layout detection by dynamically precomputing page structures within its prompt, making it more resilient when parsing atypical PDF layouts and complex document geometries.

• An evolutionary update to the existing Granite Docling parsing architecture.
• Introduces a dynamic prompt that precomputes layout objects on a given page.
• Specifically designed to handle out-of-distribution document layouts robustly.

Improves OCR and document structural understanding when working with out-of-distribution layouts.

SOURCES

[1]

1. Build Complete LLM Observability Pipelines with Langfuse

2. Microsoft Releases Webwright, a Terminal-Native Web Agent Framework

3. StepFun Releases StepAudio 2.5 Realtime End-to-End Voice Model

4. hipEngine Delivers Fast ROCm-Native Inference on AMD RDNA3

5. Uncensored Genesis Qwen 3.6 35B Local Quantized Formats Released

6. IBM Releases granite-docling-2stage-258m for Robust Document Parsing

Inference Brew in your inbox