Apply Ordinal Encoder

Encodes categorical columns to Integers. Automatically detects columns if a fitted encoder is provided.

Apply Ordinal Encoder

Processing

This brick transforms categorical text data into numerical integers. It scans your data, identifies unique categories (like "Low", "Medium", "High"), and assigns a specific number ID to each one (e.g., 0, 1, 2).

This process is essential for preparing data for machine learning models, as most algorithms require numerical input rather than text. The brick is "smart": if you provide a previously used (fitted) encoder, it will apply the exact same numbering rules to new data. If you don't, it will learn new rules based on the current data.

Inputs

data: The dataset containing the categorical text columns you wish to transform.
columns (optional): A specific list of column names to encode. If provided, this overrides the Target Columns setting in the options. If you leave this empty, the brick will look at the options settings to decide which columns to process.
encoder (optional): A pre-trained encoder object (from a previous execution of this brick).

If provided: The brick ignores your column selections and strictly processes the columns expected by this encoder to ensure consistency.
If empty: The brick creates a new encoder and learns from the current data.

Inputs Types

Input	Types
`data`	`DataFrame`, `ArrowTable`
`columns`	`List`
`encoder`	`Any`

You can check the list of supported types here: Available Type Hints.

Outputs

result: The processed dataset where the selected categorical columns have been replaced with their integer representations.
Fitted Encoder: The "brain" of the operation. This object contains the rules (mappings) learned during processing. You can pass this to future "Apply Ordinal Encoder" bricks to ensure new data is encoded exactly the same way (e.g., "Red" is always 0).

Outputs Types

Output	Types
`result`	`DataFrame`
`Fitted Encoder`	`Any`

You can check the list of supported types here: Available Type Hints.

Options

The Apply Ordinal Encoder brick contains some changeable options:

Target Columns: The names of the columns you want to encode.

If this is empty (and columns input is empty), the brick attempts to auto-detect and encode all text-based columns.

Exclude Target Columns: A toggle to invert your selection. If active, The brick processes every categorical column except the ones listed in "Target Columns".
Handle Unknown: Controls how the brick reacts when it encounters a new category that wasn't present when the encoder was first trained (only applies when using a pre-existing encoder).

Error: The workflow stops and raises an error alerting you to the new data.
Use Encoded Value: The workflow continues, and any unknown categories are assigned a value of -1. This is useful for preventing crashes in production workflows.

Verbose: Toggles whether detailed logs about the encoding process (such as which columns were selected) are written to the logs.

import logging
import pandas as pd
import polars as pl
import pyarrow as pa
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from coded_flows.types import Union, DataFrame, ArrowTable, Tuple, List, Any
from coded_flows.utils import CodedFlowsLogger

logger = CodedFlowsLogger(name="Apply Ordinal Encoder", level=logging.INFO)


def _coalesce(*values):
    """Helper to return the first non-None value."""
    return next((v for v in values if v is not None), None)


def ordinal_encoder(
    data: Union[DataFrame, ArrowTable],
    columns: List = None,
    encoder: Any = None,
    options=None,
) -> Tuple[DataFrame, Any]:
    options = options or {}
    verbose = options.get("verbose", True)
    unknown_strategy = (
        options.get("handle_unknown", "use_encoded_value").lower().replace(" ", "_")
    )
    target_list = _coalesce(columns, options.get("columns_target"), [])
    exclude_mode = options.get("exclude_mode", False)
    result = None
    Fitted_Encoder = None
    try:
        verbose and logger.info("Starting Integer Ordinal Encoding.")
        if isinstance(data, pl.DataFrame):
            result = data.to_pandas()
            verbose and logger.info("Converted Polars DataFrame to Pandas.")
        elif isinstance(data, (pa.Table, pa.lib.Table)):
            result = data.to_pandas()
            verbose and logger.info("Converted Arrow Table to Pandas.")
        elif isinstance(data, pd.DataFrame):
            result = data.copy()
            verbose and logger.info("Using input Pandas DataFrame.")
        else:
            raise ValueError(
                "Input data must be a pandas DataFrame, Polars DataFrame, or Arrow Table"
            )
        cols_to_process = []
        if encoder is not None and hasattr(encoder, "feature_names_in_"):
            expected_cols = list(encoder.feature_names_in_)
            verbose and logger.info(
                f"Encoder provided. Automatically selecting {len(expected_cols)} required columns."
            )
            missing_cols = [c for c in expected_cols if c not in result.columns]
            if missing_cols:
                err_msg = f"The provided encoder expects the following columns which are missing in the data: {missing_cols}"
                verbose and logger.error(err_msg)
                raise ValueError(err_msg)
            cols_to_process = expected_cols
            Fitted_Encoder = encoder
        else:
            all_categorical_cols = result.select_dtypes(
                include=["object", "category"]
            ).columns.tolist()
            if exclude_mode:
                verbose and logger.info(f"Mode: EXCLUDE. Excluding: {target_list}")
                if not target_list:
                    cols_to_process = all_categorical_cols
                else:
                    cols_to_process = [
                        c for c in all_categorical_cols if c not in target_list
                    ]
            else:
                verbose and logger.info(f"Mode: INCLUDE. Including: {target_list}")
                if not target_list:
                    verbose and logger.info(
                        "Target list empty. Auto-detecting categorical columns."
                    )
                    cols_to_process = all_categorical_cols
                else:
                    missing = [c for c in target_list if c not in result.columns]
                    if missing:
                        raise ValueError(
                            f"Columns specified for encoding not found in dataset: {missing}"
                        )
                    cols_to_process = target_list
            if encoder is not None:
                if hasattr(encoder, "n_features_in_") and encoder.n_features_in_ != len(
                    cols_to_process
                ):
                    raise ValueError(
                        f"Encoder mismatch: Encoder expects {encoder.n_features_in_} features, but {len(cols_to_process)} were selected via options."
                    )
                Fitted_Encoder = encoder
            else:
                Fitted_Encoder = None
        verbose and logger.info(f"Columns selected for processing: {cols_to_process}")
        if not cols_to_process:
            verbose and logger.warning(
                "No columns selected for encoding. Returning original data."
            )
            if Fitted_Encoder is None:
                Fitted_Encoder = OrdinalEncoder()
        else:
            subset = result[cols_to_process].astype(str)
            if Fitted_Encoder is not None:
                verbose and logger.info("Transforming data using provided encoder.")
                encoded_matrix = Fitted_Encoder.transform(subset)
            else:
                verbose and logger.info("Fitting and transforming new encoder.")
                Fitted_Encoder = OrdinalEncoder(
                    handle_unknown=unknown_strategy,
                    unknown_value=-1,
                    encoded_missing_value=-1,
                    dtype=np.int32,
                )
                encoded_matrix = Fitted_Encoder.fit_transform(subset)
            result[cols_to_process] = encoded_matrix
            verbose and logger.info(
                f"Successfully encoded {len(cols_to_process)} columns."
            )
        verbose and logger.info(f"Process complete. Output shape: {result.shape}")
    except Exception as e:
        verbose and logger.error(f"Error during encoding operation: {e}")
        raise
    return (result, Fitted_Encoder)

Brick Info

version v0.1.4

python 3.11, 3.12, 3.13

requirements

shap>=0.47.0
scikit-learn
pandas
numpy
polars[pyarrow]
numba>=0.56.0