Apply UMAP Transform
Projects new data using a pre-trained UMAP model. Appends UMAP embedding columns to the dataset. Ensure input features match the training data.
Apply UMAP Transform
Processing
This brick takes new data and projects it into a simplified lower-dimensional space using a UMAP model you have already trained.
Think of it like using a saved map to locate new addresses. Instead of learning patterns from scratch, this brick applies the rules from your pre-trained model to calculate coordinates (embeddings) for your new data. It processes your input dataset and appends these new coordinate columns to it, preserving your original data.
For this to work correctly, the new data must have the same columns (features) as the data used to train the model.
Inputs
- data
- The dataset you want to transform. This should contain the same feature columns (e.g., numerical values) that were present in the dataset used to train the UMAP model.
- UMAP
- The pre-trained UMAP model object. This is usually the output from a brick training UMAP. It contains the mathematical "map" required to project the new data.
- Scaler (optional)
- A pre-trained data scaler (e.g., StandardScaler, MinMaxScaler). If you scaled your data before training the UMAP model, you must provide the same scaler here to ensure the new data is processed identically.
Inputs Types
| Input | Types |
|---|---|
data |
DataFrame, ArrowTable |
UMAP |
Any |
Scaler |
Any |
You can check the list of supported types here: Available Type Hints.
Outputs
- projected_data
- The result of the transformation. This dataset includes all your original rows and columns, plus new columns containing the UMAP embeddings (coordinates).
The projected data output will generally contain:
- {Original Columns}: All columns present in the input
data. - UMAP1: The first dimension of the UMAP embedding (coordinate X).
- UMAP2: The second dimension of the UMAP embedding (coordinate Y).
- (Additional columns UMAP3, UMAP4, etc. may appear depending on how many components the model was trained to produce).
Outputs Types
| Output | Types |
|---|---|
projected data |
DataFrame |
You can check the list of supported types here: Available Type Hints.
Options
The Apply UMAP Transform brick contains some changeable options:
- Component Column Prefix
- Defines the name of the new columns added to your dataset. For example, if the default is "UMAP", the new columns will be named UMAP1, UMAP2, etc. If you change this to "Dim", the columns will be Dim1, Dim2.
- Verbose
- Controls the amount of logging information.
import logging
import pandas as pd
import polars as pl
import pyarrow as pa
import numpy as np
from coded_flows.types import Union, DataFrame, ArrowTable, Any, List
from coded_flows.utils import CodedFlowsLogger
logger = CodedFlowsLogger(name="Apply UMAP Transform", level=logging.INFO)
def _convert_to_pandas(data: Any, verbose: bool) -> pd.DataFrame:
"""Helper to safely convert input to Pandas DataFrame."""
df = None
if isinstance(data, pd.DataFrame):
df = data
elif isinstance(data, pl.DataFrame):
verbose and logger.info("Converting Polars DataFrame to Pandas.")
df = data.to_pandas()
elif isinstance(data, (pa.Table, pa.lib.Table)):
verbose and logger.info("Converting Arrow Table to Pandas.")
df = data.to_pandas()
elif isinstance(data, np.ndarray):
verbose and logger.info("Converting NumPy array to Pandas.")
df = pd.DataFrame(data)
else:
raise ValueError(
f"Unsupported data type: {type(data)}. Expected DataFrame, Table, or Array."
)
return df
def _get_feature_columns(df: pd.DataFrame, model: Any, verbose: bool) -> List[str]:
"""
Helper to determine which columns to use.
Prioritizes model metadata (feature_names_in_), falls back to all numeric.
"""
cols = []
if hasattr(model, "feature_names_in_"):
cols = list(model.feature_names_in_)
verbose and logger.info(
f"Inferred {len(cols)} columns from trained UMAP model metadata."
)
else:
cols = df.select_dtypes(include=["number", "bool"]).columns.tolist()
verbose and logger.warning(
"No model metadata found. Using all numeric columns available in input."
)
missing = [c for c in cols if c not in df.columns]
if missing:
raise ValueError(
f"The following required columns are missing in data: {missing}"
)
return cols
def apply_umap_transform(
data: Union[DataFrame, ArrowTable], UMAP: Any, Scaler: Any = None, options=None
) -> DataFrame:
options = options or {}
verbose = options.get("verbose", True)
col_prefix = options.get("col_prefix", "UMAP")
projected_data = pd.DataFrame()
try:
verbose and logger.info("Starting UMAP Transformation.")
if UMAP is None:
raise ValueError("UMAP Model input is missing (None).")
df = _convert_to_pandas(data, verbose)
if df.empty:
raise ValueError("Input DataFrame is empty.")
feature_cols = _get_feature_columns(df, UMAP, verbose)
if hasattr(UMAP, "n_features_in_"):
if len(feature_cols) != UMAP.n_features_in_:
raise ValueError(
f"Model expects {UMAP.n_features_in_} features, but {len(feature_cols)} features were identified in the input."
)
X = df[feature_cols].values
if Scaler is not None:
verbose and logger.info("Applying provided Scaler to data.")
try:
X = Scaler.transform(X)
except Exception as e:
raise ValueError(f"Failed to apply Scaler: {e}")
else:
verbose and logger.info("No Scaler provided. Proceeding with raw data.")
verbose and logger.info("Projecting data into UMAP space (this may take time).")
try:
X_embedded = UMAP.transform(X)
except Exception as e:
raise ValueError(f"UMAP transform failed: {e}")
n_components = X_embedded.shape[1]
umap_col_names = [f"{col_prefix}{i + 1}" for i in range(n_components)]
verbose and logger.info(
f"Generated {n_components} embedding components. Merging with original data."
)
projected_data = pd.DataFrame(
X_embedded, columns=umap_col_names, index=df.index
)
verbose and logger.info(
f"Transformation complete. Output shape: {projected_data.shape}"
)
except Exception as e:
verbose and logger.error(f"Error during UMAP transformation: {e}")
raise
return projected_data
Brick Info
- shap>=0.47.0
- scikit-learn
- numpy
- pandas
- pyarrow
- polars[pyarrow]
- umap-learn
- numba>=0.56.0