Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
339 changes: 339 additions & 0 deletions docs/reference/mlflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
# MLflow Integration

Feast provides **native integration** with [MLflow](https://mlflow.org/) for automatic feature lineage tracking alongside ML experiments. When enabled, every feature retrieval is logged to the active MLflow run.

## Overview

- **Which features did this model use?** -- auto-logged on every `get_historical_features()` / `get_online_features()` call
- **Which feature service should I use to serve this model?** -- resolved from model URI via `store.mlflow.resolve_features()`
- **Can I reproduce the exact training data?** -- entity DataFrame saved as an MLflow artifact
- **Which models break if I change a feature view?** -- reverse index via the Feast UI `/api/mlflow-feature-usage` endpoint
- **When was the feature store last updated?** -- `feast apply` and `feast materialize` logged to a separate ops experiment

### Capabilities

| Capability | How |
|---|---|
| Auto-log feature metadata | Tags on every retrieval inside an active MLflow run |
| Entity DataFrame archival | `entity_df.parquet` artifact for full reproducibility |
| Model registration with lineage | `feast.feature_service` tag propagated to model versions |
| Training-to-prediction linkage | `store.mlflow.load_model()` links prediction runs back to training runs |
| Model-to-feature resolution | Map any model URI back to its Feast feature service |
| Operation audit trail | `feast apply` / `feast materialize` logged to `{project}-feast-ops` |
| `store.mlflow` API | Single entry point — zero `import mlflow`, zero client objects |
| Feast UI integration | Per-feature-view usage stats and registered model associations |

## Installation

MLflow is an optional dependency:

```bash
pip install feast[mlflow]
```

## Configuration

Add the `mlflow` section to your `feature_store.yaml`:

```yaml
project: my_project
registry: data/registry.db
provider: local
online_store:
type: sqlite
path: data/online_store.db

mlflow:
enabled: true
tracking_uri: http://127.0.0.1:5000 # optional, falls back to MLFLOW_TRACKING_URI env var
auto_log: true # default
auto_log_entity_df: false # default
entity_df_max_rows: 100000 # default
log_operations: false # default
ops_experiment_suffix: "-feast-ops" # default
```

### Configuration options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `enabled` | bool | `false` | Master switch for the entire integration |
| `tracking_uri` | string | *(none)* | MLflow tracking server URI. Falls back to `MLFLOW_TRACKING_URI` env var, then MLflow default (`./mlruns`) |
| `auto_log` | bool | `true` | Automatically log feature metadata on every retrieval when an active MLflow run exists |
| `auto_log_entity_df` | bool | `false` | Save the entity DataFrame as `entity_df.parquet` artifact on historical retrieval |
| `entity_df_max_rows` | int | `100000` | Skip entity DataFrame artifact upload for DataFrames exceeding this limit |
| `log_operations` | bool | `false` | Log `feast apply` and `feast materialize` to a separate MLflow experiment |
| `ops_experiment_suffix` | string | `"-feast-ops"` | Suffix appended to project name for the operations experiment |

### Tracking URI resolution

The tracking URI is resolved in this order:

1. `tracking_uri` field in `feature_store.yaml`
2. `MLFLOW_TRACKING_URI` environment variable
3. MLflow's default (`./mlruns` local directory)

This means you can omit `tracking_uri` from the YAML and set `MLFLOW_TRACKING_URI` in your environment instead, or it would be pulled from `./mlruns` automatically when both are not set.

## What gets logged

### Tags on retrieval runs

When `auto_log: true` and an active MLflow run exists, each `get_historical_features()` or `get_online_features()` call records:

| Tag | Example | Description |
|-----|---------|-------------|
| `feast.project` | `my_project` | Feast project name |
| `feast.retrieval_type` | `historical` / `online` | Type of feature retrieval |
| `feast.feature_service` | `driver_activity_v1` | Auto-resolved feature service name (if matched) |
| `feast.feature_views` | `driver_hourly_stats` | Comma-separated feature view names |
| `feast.feature_refs` | `driver_hourly_stats:conv_rate,...` | All feature references |
| `feast.entity_count` | `200` | Number of entities in the request |
| `feast.feature_count` | `5` | Number of features retrieved |

### Metrics

| Metric | Example | Description |
|--------|---------|-------------|
| `feast.job_submission_sec` | `0.4321` | Feature retrieval duration in seconds |

### Artifacts

When `auto_log_entity_df: true` and the entity DataFrame has fewer than `entity_df_max_rows` rows:

| Artifact | Description |
|----------|-------------|
| `entity_df.parquet` | Full entity DataFrame used in the retrieval |

When a model is logged via `store.mlflow.log_model()`:

| Artifact | Description |
|----------|-------------|
| `feast_features.json` | JSON list of feature references the model was trained on |

### Entity DataFrame metadata

Regardless of `auto_log_entity_df`, the following metadata is logged when present:

| Tag / Param | When | Description |
|-------------|------|-------------|
| `feast.entity_df_type` | Always | `dataframe`, `sql`, or `range` |
| `feast.entity_df_rows` | DataFrame input | Row count |
| `feast.entity_df_columns` | DataFrame input | Column names |
| `feast.entity_df_query` | SQL input | The SQL query string |
| `feast.start_date` / `feast.end_date` | Range-based input | Date range |

### Operation logs

When `log_operations: true`, `feast apply` and `feast materialize` create self-contained runs in the `{project}{ops_experiment_suffix}` experiment (default: `my_project-feast-ops`):

**Apply runs:**

| Tag / Metric | Example |
|--------------|---------|
| `feast.operation` | `apply` |
| `feast.project` | `my_project` |
| `feast.feature_views_changed` | `driver_hourly_stats,order_stats` |
| `feast.feature_services_changed` | `driver_activity_v1` |
| `feast.entities_changed` | `driver,restaurant` |
| `feast.apply.feature_views_count` | `2` |
| `feast.apply.feature_services_count` | `1` |
| `feast.apply.entities_count` | `2` |

**Materialize runs:**

| Tag / Metric | Example |
|--------------|---------|
| `feast.operation` | `materialize` / `materialize_incremental` |
| `feast.project` | `my_project` |
| `feast.materialize.feature_views` | `driver_hourly_stats` |
| `feast.materialize.start_date` | `2024-01-01T00:00:00` |
| `feast.materialize.end_date` | `2024-01-02T00:00:00` |
| `feast.materialize.duration_sec` | `12.3456` |

## Usage

### Automatic logging (zero code)

With the configuration above, feature metadata is logged automatically whenever there is an active MLflow run. No explicit `import mlflow` is needed — just use `store.mlflow`:

```python
from feast import FeatureStore

store = FeatureStore(".")

with store.mlflow.start_run(run_name="my_training"):
training_df = store.get_historical_features(
features=store.get_feature_service("driver_activity_v1"),
entity_df=entity_df,
).to_df()
# The run is now tagged with feast.feature_refs, feast.feature_views, etc.

model = train(training_df)
store.mlflow.log_model(model, "model")
```

No extra code needed — the tags are written automatically.

### `store.mlflow` API (recommended)

`store.mlflow` is the primary way to interact with the Feast–MLflow integration. It provides Feast-enhanced versions of common MLflow operations, and delegates everything else to the raw `mlflow` module:

```python
from feast import FeatureStore
from sklearn.linear_model import LogisticRegression

store = FeatureStore(".")

# Training
with store.mlflow.start_run(run_name="v1_training"):
df = store.get_historical_features(
features=store.get_feature_service("driver_activity_v1"),
entity_df=entity_df,
).to_df()

model = LogisticRegression().fit(X, y)
store.mlflow.log_model(model, "model") # Feast-enhanced: saves feast_features.json
train_run_id = store.mlflow.active_run_id

# Register model (auto-tags version with feast.feature_service)
store.mlflow.register_model(f"runs:/{train_run_id}/model", "driver_model")

# Prediction (auto-links to training run)
with store.mlflow.start_run(run_name="prediction"):
model = store.mlflow.load_model("models:/driver_model/1")
online_features = store.get_online_features(
features=store.get_feature_service("driver_activity_v1"),
entity_rows=[{"driver_id": 1001}],
)
predictions = model.predict(...)
```

### `feast.mlflow` module API (alternative)

For users who prefer a module-level import, `feast.mlflow` is a **drop-in replacement for `import mlflow`** that delegates to the same `store.mlflow` client under the hood:

```python
import feast.mlflow
from feast import FeatureStore

store = FeatureStore(".") # auto-registers with feast.mlflow

with feast.mlflow.start_run(run_name="training"):
df = store.get_historical_features(...).to_df()
feast.mlflow.log_params({"lr": "0.01"}) # plain passthrough
feast.mlflow.log_metrics({"f1": 0.85}) # plain passthrough
feast.mlflow.log_model(model, "model") # Feast-enhanced
```

#### Store resolution

`feast.mlflow` resolves its `FeatureStore` in this order:

1. **Explicit `feast.mlflow.init(store)`** — if called, overrides everything
2. **Auto-registered** — the most recently created `FeatureStore` with `mlflow.enabled=true` registers itself automatically
3. **Auto-discovery** — falls back to `FeatureStore(".")` from the current directory

In most cases, simply creating a `FeatureStore(...)` is enough — no `init()` needed.

#### Error handling

`feast.mlflow` raises clear errors on first use if something is misconfigured:

| Condition | Error |
|-----------|-------|
| No `feature_store.yaml` in cwd and no store created | `RuntimeError` with guidance to call `feast.mlflow.init(store)` |
| `mlflow.enabled` is not set to `true` | `RuntimeError` with guidance to set `mlflow.enabled=true` |
| `mlflow` pip package not installed | `ImportError` with guidance to run `pip install feast[mlflow]` |

When `mlflow.enabled` is `false` (or omitted), `store.mlflow` returns `None`, allowing callers to guard with `if store.mlflow:`. The `feast.mlflow` module raises `RuntimeError` only when you attempt to use it without an enabled store.

### Feast-enhanced functions

These functions add automatic Feast tagging and lineage on top of their MLflow counterparts:

| Function | Enhancement |
|----------|-------------|
| `store.mlflow.start_run(run_name, tags)` | Auto-tags run with `feast.project` |
| `store.mlflow.log_model(model, path, flavor)` | Auto-attaches `feast_features.json` artifact |
| `store.mlflow.register_model(model_uri, name)` | Auto-tags model version with `feast.feature_service` |
| `store.mlflow.load_model(model_uri)` | Auto-tags prediction run with training lineage |

**Supported model flavors for `log_model()`:** `sklearn`, `pytorch`, `xgboost`, `lightgbm`, `tensorflow`, `keras`, `pyfunc`.

### Feast-only functions

These are unique to the Feast integration and have no `mlflow` equivalent:

| Function | Description |
|----------|-------------|
| `store.mlflow.resolve_features(model_uri)` | Resolve model URI to Feast feature service name |
| `store.mlflow.get_training_entity_df(run_id, ...)` | Recover entity DataFrame from a past MLflow run |
| `store.mlflow.active_run_id` | Current active MLflow run ID (or `None`) |
| `feast.mlflow.init(store)` | Explicitly bind `feast.mlflow` module to a `FeatureStore` (optional) |

### Passthrough behavior

Any attribute not listed above delegates to the raw `mlflow` module. This means you can use `store.mlflow` or `feast.mlflow` everywhere you previously used `import mlflow`:

```python
store.mlflow.mlflow.log_params(params) # via escape hatch
feast.mlflow.log_params(params) # via module API
feast.mlflow.log_metrics(metrics)
feast.mlflow.set_tag("env", "staging")
feast.mlflow.MlflowClient()
```

### Resolve a model back to its feature service

```python
from feast import FeatureStore

store = FeatureStore(".")
fs_name = store.mlflow.resolve_features("models:/driver_model/1")
# Returns: "driver_activity_v1"
```

Resolution order:
1. Model version tag `feast.feature_service` (set by `register_model()`)
2. Training run tag `feast.feature_service` (set by auto-logging)

### Reproduce training from a past run

```python
from feast import FeatureStore

store = FeatureStore(".")

entity_df = store.mlflow.get_training_entity_df(run_id="abc123")

with store.mlflow.start_run(run_name="retrain_v2"):
new_df = store.get_historical_features(
features=store.get_feature_service("driver_activity_v1"),
entity_df=entity_df,
).to_df()
model = train(new_df)
store.mlflow.log_model(model, "model")
```

This requires `auto_log_entity_df: true` to have been enabled when the original run was recorded.

## Feast UI integration

The Feast UI server exposes three API endpoints that aggregate data from MLflow:

| Endpoint | Description |
|----------|-------------|
| `/api/mlflow-runs` | All Feast-tagged MLflow runs with linked registered models |
| `/api/mlflow-feature-usage` | Per-feature-view usage stats (run count, last used, associated models) |
| `/api/mlflow-feature-models` | Reverse index of feature refs to registered models |

The feature view detail page in the Feast UI displays:
- **MLflow Training Runs** count and **Last Used** date in the header stats
- An **MLflow Usage** panel showing training run count, relative last-used time, and a table of registered models that depend on the feature view

Start the Feast UI with:

```bash
feast ui --host 127.0.0.1 --port 8888
```
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ snowflake = [
]
sqlite_vec = ["sqlite-vec==v0.1.6"]
mcp = ["fastapi_mcp"]
mlflow = ["mlflow>=2.10.0"]

dbt = ["dbt-artifacts-parser"]

Expand Down
Loading
Loading