Apply PCA Transform
Projects data using a pre-trained PCA model. Appends PC columns to the dataset. Note: Filter your columns before this brick if specific features are required.
Apply PCA Transform
Processing
This brick applies a pre-trained Principal Component Analysis (PCA) model to your dataset. It calculates new "component" scores—which represent patterns in your data—and appends these scores as new columns to your original dataset.
If you provided a Scaler model (like a Standard Scaler), the brick will automatically normalize the input data before applying the PCA transform to ensure accurate results. The brick attempts to match columns based on the metadata stored in the trained model; if no metadata is found, it will attempt to use all numeric columns in your dataset.
Inputs
- data
- The dataset you want to transform. This should contain the same feature columns (variables) used to train the PCA model originally.
- PCA
- The pre-trained PCA model object. This is usually the output from a "Train PCA" brick or similar training step.
- Scaler
- (Optional) The pre-trained scaler model used to normalize the data during training (e.g., StandardScaler). If your PCA model was trained on scaled data, you must provide the scaler here to ensure the new data is processed identically.
Inputs Types
| Input | Types |
|---|---|
data |
DataFrame, ArrowTable |
PCA |
Any |
Scaler |
Any |
You can check the list of supported types here: Available Type Hints.
Outputs
- projected data
- The resulting dataset. It contains all your original data columns plus the new Principal Component columns (e.g.,
PC1,PC2) appended to the right side.
Outputs Types
| Output | Types |
|---|---|
projected data |
DataFrame |
You can check the list of supported types here: Available Type Hints.
Options
The Apply PCA Transform brick contains some changeable options:
- Component Column Prefix
- Defines the naming convention for the new columns added to your dataset.
- Default ("PC"): The columns will be named
PC1,PC2,PC3, etc. - Custom: If you change this to "Score", the columns will be named
Score1,Score2, etc.
- Verbose
- Controls the amount of information logged during execution.
import logging
import pandas as pd
import polars as pl
import pyarrow as pa
import numpy as np
from coded_flows.types import Union, DataFrame, ArrowTable, Any, List
from coded_flows.utils import CodedFlowsLogger
logger = CodedFlowsLogger(name="Apply PCA Transform", level=logging.INFO)
def _convert_to_pandas(data: Any, verbose: bool) -> pd.DataFrame:
"""Helper to safely convert input to Pandas DataFrame."""
df = None
if isinstance(data, pd.DataFrame):
df = data
elif isinstance(data, pl.DataFrame):
verbose and logger.info("Converting Polars DataFrame to Pandas.")
df = data.to_pandas()
elif isinstance(data, (pa.Table, pa.lib.Table)):
verbose and logger.info("Converting Arrow Table to Pandas.")
df = data.to_pandas()
elif isinstance(data, np.ndarray):
verbose and logger.info("Converting NumPy array to Pandas.")
df = pd.DataFrame(data)
else:
raise ValueError(
f"Unsupported data type: {type(data)}. Expected DataFrame, Table, or Array."
)
return df
def _get_feature_columns(df: pd.DataFrame, pca_model: Any, verbose: bool) -> List[str]:
"""
Helper to determine which columns to use.
Prioritizes model metadata, falls back to all numeric.
"""
cols = []
if hasattr(pca_model, "feature_names_in_"):
cols = list(pca_model.feature_names_in_)
verbose and logger.info(
f"Inferred {len(cols)} columns from trained PCA model metadata."
)
else:
cols = df.select_dtypes(include=["number", "bool"]).columns.tolist()
verbose and logger.warning(
"No model metadata found. Using all numeric columns available in input."
)
missing = [c for c in cols if c not in df.columns]
if missing:
raise ValueError(
f"The following required columns are missing in data: {missing}"
)
return cols
def apply_pca_transform(
data: Union[DataFrame, ArrowTable], PCA: Any, Scaler: Any = None, options=None
) -> DataFrame:
options = options or {}
verbose = options.get("verbose", True)
col_prefix = options.get("col_prefix", "PC")
projected_data = pd.DataFrame()
try:
verbose and logger.info("Starting PCA Transformation.")
if PCA is None:
raise ValueError("PCA Model input is missing (None).")
df = _convert_to_pandas(data, verbose)
if df.empty:
raise ValueError("Input DataFrame is empty.")
feature_cols = _get_feature_columns(df, PCA, verbose)
if hasattr(PCA, "n_features_in_"):
if len(feature_cols) != PCA.n_features_in_:
raise ValueError(
f"Model expects {PCA.n_features_in_} features, but {len(feature_cols)} features were identified in the input."
)
X = df[feature_cols].values
if Scaler is not None:
verbose and logger.info("Applying provided Scaler to data.")
try:
X = Scaler.transform(X)
except Exception as e:
raise ValueError(f"Failed to apply Scaler: {e}")
else:
verbose and logger.info("No Scaler provided. Proceeding with raw data.")
verbose and logger.info("Projecting data into PCA space.")
try:
X_pca = PCA.transform(X)
except Exception as e:
raise ValueError(f"PCA transform failed: {e}")
n_components = X_pca.shape[1]
pc_col_names = [f"{col_prefix}{i + 1}" for i in range(n_components)]
verbose and logger.info(
f"Generated {n_components} components. Merging with original data."
)
projected_data = pd.DataFrame(X_pca, columns=pc_col_names, index=df.index)
verbose and logger.info(
f"Transformation complete. Output shape: {projected_data.shape}"
)
except Exception as e:
verbose and logger.error(f"Error during PCA transformation: {e}")
raise
return projected_data
Brick Info
- shap>=0.47.0
- scikit-learn
- numpy
- pandas
- pyarrow
- polars[pyarrow]
- numba>=0.56.0