Apply Target Encoder

Applies a pre-fitted Target Encoder to a specific series or array.

Apply Target Encoder

Processing

This brick applies a pre-trained Target Encoder (or similar Scikit-Learn compatible encoder) to your dataset. It transforms categorical data—such as city names, product categories, or status labels—into numerical values based on the specific rules and patterns learned during the encoder's training phase.

The brick automatically standardizes various input formats (including Lists, Polars/Pandas Series, and NumPy arrays) into the correct format required by the encoder, applies the transformation, and returns the result as a standardized series.

Inputs

data: The raw data you want to transform. This is typically a list or column of categorical values (strings or labels) that needs to be converted into numbers.
Encoder: The pre-fitted encoder object containing the logic for the transformation. This object must have already been "trained" or "fitted" in a previous step.

Inputs Types

Input	Types
`data`	`DataSeries`, `NDArray`, `List`
`Encoder`	`Any`

You can check the list of supported types here: Available Type Hints.

Outputs

encoded data: The transformed data. This series contains the numerical representations of your input categories based on the encoder's rules.

Outputs Types

Output	Types
`encoded data`	`DataSeries`

You can check the list of supported types here: Available Type Hints.

Options

The Apply Target Encoder brick contains some changeable options:

Reshape Input to 2D: Controls the shape of the data fed into the encoder. Some encoders require data to be shaped as a 2D column (e.g., [[1], [2]]), while others prefer a flat 1D list (e.g., [1, 2]). When enabled, it reshapes the input into a 2D column format. Use this if your encoder expects a matrix.
Verbose: Controls the amount of logging information generated during processing.

Example

Input:

data: ["High Risk", "Low Risk", "High Risk", "Medium Risk"]
Encoder: A Target Encoder object previously trained to associate each label with a numerical value.

Output:

encoded_data: [1, 2, 1, 3]

Explanation: The brick uses the logic stored in the Encoder input to swap the text categories ("High Risk") with the specific numerical values established during training (1). Note that the first and third items are identical because they share the same category.

import logging
import numpy as np
import pandas as pd
import polars as pl
import pyarrow as pa
from coded_flows.types import Union, DataSeries, NDArray, List, Any
from coded_flows.utils import CodedFlowsLogger

logger = CodedFlowsLogger(name="Apply Target Encoder", level=logging.INFO)


def apply_target_encoder(
    data: Union[DataSeries, NDArray, List], Encoder: Any, options=None
) -> DataSeries:
    options = options or {}
    verbose = options.get("verbose", True)
    reshape_2d = options.get("reshape_2d", False)
    encoded_data = None
    try:
        verbose and logger.info("Starting Target Encoder application process.")
        if Encoder is None:
            raise ValueError("No Encoder object provided.")
        working_data = None
        if isinstance(data, list):
            working_data = np.array(data)
            verbose and logger.info("Detected Input: Python List.")
        elif isinstance(data, (pa.Array, pa.ChunkedArray)):
            working_data = data.to_numpy()
            verbose and logger.info("Detected Input: PyArrow Array.")
        elif isinstance(data, pl.Series):
            working_data = data.to_numpy()
            verbose and logger.info("Detected Input: Polars Series.")
        elif isinstance(data, pd.Series):
            working_data = data.values
            verbose and logger.info("Detected Input: Pandas Series.")
        elif isinstance(data, np.ndarray):
            working_data = data
            verbose and logger.info("Detected Input: NumPy Array.")
        else:
            raise ValueError(
                f"Unsupported input type: {type(data)}. Expected List, Series, or Array."
            )
        if reshape_2d:
            if working_data.ndim == 1:
                working_data = working_data.reshape(-1, 1)
                verbose and logger.info("Reshaped input data to 2D.")
        elif working_data.ndim > 1:
            working_data = working_data.ravel()
            verbose and logger.info("Flattened input data to 1D.")
        verbose and logger.info(f"Applying transform using {type(Encoder).__name__}.")
        if not hasattr(Encoder, "transform"):
            raise AttributeError(
                f"The provided Encoder ({type(Encoder).__name__}) does not have a 'transform' method."
            )
        transformed_data = Encoder.transform(working_data)
        if isinstance(transformed_data, np.ndarray) and transformed_data.ndim > 1:
            if transformed_data.shape[1] == 1:
                transformed_data = transformed_data.ravel()
            else:
                transformed_data = list(transformed_data)
        encoded_data = pd.Series(transformed_data, name="encoded_target")
        verbose and logger.info("Encoder transform applied successfully.")
    except Exception as e:
        verbose and logger.error(f"Error during encoder application")
        raise e
    return encoded_data

Brick Info

version v0.1.4

python 3.11, 3.12, 3.13

requirements

shap>=0.47.0
scikit-learn
pandas
pyarrow
numpy
numba>=0.56.0
polars