MLflow DAIS 2026

MLflow 3.0: Unified AI Experimentation, Observability & Governance

MLflow 3.0 solves the three core obstacles of GenAI development: production observability, quality measurement at scale, and continuous improvement with structured feedback loops.

What's new

  • 30M+ monthly downloads; version 3.0 available with no migration for existing users
  • Production traces for 20+ GenAI libraries built on OpenTelemetry
  • LLM judges for systematic evaluation of safety, groundedness, and relevance
  • Review App for expert annotation without code
  • Prompt Registry with Git-style version control and visual diffs

By the numbers

30M+ Monthly downloads
3.0 Current version
GenAI Native support
MLflow 3.0
experiment tracking
Metric: Accuracy vs Epoch
run-001
run-002
run-003
151015202530
Run Accuracy F1 Score Model Status
run-001 0.891 0.883 llama-3-8b FINISHED
run-002 0.923 0.918 databricks-agent FINISHED BEST
run-003 0.874 0.867 gpt-4o-mini FINISHED
MLflow Judge — Automatic LLM Evaluation
4.2
/5
Relevance
4.7
/5
Accuracy
4.5
/5
Coherence
Full analysis

Overview

Databricks announced MLflow 3.0, a major platform evolution extending MLOps capabilities to generative AI while maintaining support for traditional ML and deep learning. With over 30 million monthly downloads, MLflow is the world’s most widely used ML experiment tracking platform.

Version 3.0 consolidates into one platform what was previously fragmented across multiple tools: production observability, quality evaluation, and continuous improvement cycles for GenAI applications.

The Three GenAI Obstacles MLflow 3.0 Solves

1. Production-Scale Observability

MLflow 3.0 captures detailed traces from 20+ GenAI libraries and custom business logic. The mlflow-tracing package is optimized for production performance and built on OpenTelemetry for enterprise-grade observability.

Timeline visualization reveals performance bottlenecks and inefficiencies. In the blog’s case study, traces revealed that a retail chatbot made sequential warehouse inventory checks and retrieved excessive order history, each causing 5+ second delays. Parallelizing warehouse checks and filtering recent orders reduced response time by over 50%.

Traces link to the exact prompts, code, and application versions, eliminating ambiguity about what change caused what behavior.

2. LLM Judges for Quality Evaluation

Research-backed MLflow evaluators measure GenAI output quality systematically: safety, groundedness, retrieval relevance, and response quality. They provide detailed rationales for identified issues.

Custom judges can incorporate business-specific guidelines. In the case study, judges identified 65% retrieval relevance as the root cause of poor recommendations, despite good safety and groundedness scores. Without systematic evaluation, this issue would have been invisible.

3. Integrated Feedback Collection

The Review App is a web interface for expert annotation of AI outputs without coding requirements. End-user and expert insights feed directly into evaluation and observability stacks, transforming production data into training sets for continuous improvement.

In the case study, product specialists used the Review App to annotate which products matched customer requirements, creating training data to improve retrieval from 65% to 91% relevance.

Core Features

Application Version Tracking: Captures complete application snapshots including code, prompts, LLM parameters, retrieval logic, and algorithms. Connects traces and metrics to specific versions with instant rollback capability.

Prompt Registry: Git-style version control specifically for prompt management. Visual diffs between versions highlight changes. Integration with DSPy optimizers for automatic prompt improvement.

Deployment Jobs: Automated quality gates ensuring only validated applications reach production. Unity Catalog integration for governance and audit trails. Requires stakeholder approval before production deployment.

Unified support for all AI types: The same tracking infrastructure supports GenAI applications and traditional ML model serving. The LoggedModel abstraction simplifies deep learning checkpoint tracking.

Key Points

  • 30M+ monthly downloads; version 3.0 available with no migration for existing users
  • Production traces for 20+ GenAI libraries built on OpenTelemetry
  • LLM judges for systematic evaluation of safety, groundedness, and relevance
  • Review App for expert annotation without code
  • Prompt Registry with Git-style version control and visual diffs
  • Deployment Jobs with automated quality gates and Unity Catalog integration
  • LoggedModel for unified tracking of deep learning and GenAI workloads
  • Case study: retrieval relevance improved from 65% to 91% using MLflow 3.0

Why It Matters

GenAI development has a fundamental problem: without systematic observability, teams don’t know why their applications fail. Without quality evaluation at scale, they can’t measure whether a prompt change improved or degraded quality. Without feedback loops, every production issue demands manual intervention.

MLflow 3.0 makes GenAI quality measurable and systematic. Rather than relying on manual evaluation (“does this look good?”), teams can define quality metrics specific to their use case and track them over time — exactly as they would for any production software system.

The approach transforms quality assurance from aspirational to practical, converting production issues into permanent test cases and establishing continuous feedback loops that progressively strengthen applications over time.

Based on official content from Databricks Official source