Clu. OPTICS
Performs OPTICS clustering and computes intrinsic metrics.
Clu. OPTICS
Processing
This brick organizes your data points into groups (clusters) based on how closely packed they are using the OPTICS (Ordering Points To Identify the Clustering Structure) algorithm.
Unlike simpler clustering methods that assume all clusters have the same density, OPTICS is designed to find clusters of varying densities and shapes. It works by identifying dense areas of data points and separating them from sparse areas (noise). It is particularly useful when your data contains natural groups that aren't perfectly round or equally spaced.
Inputs
- data
- The dataset containing the numerical values you want to analyze. All columns used for clustering must be numbers (integers or floats). The algorithm uses these values to calculate distances between points.
Inputs Types
| Input | Types |
|---|---|
data |
DataFrame |
You can check the list of supported types here: Available Type Hints.
Outputs
- Clustered data
- A copy of your original dataset with a new column added. This column contains the ID of the cluster assigned to each row. Points identified as "noise" or outliers are typically labeled as
-1. - Metrics
- A summary of how well the clustering performed, containing standard statistical scores. Depending on your settings, this is returned as either a small table (DataFrame) or a list of key-value pairs (Dictionary).
The Metrics output contains the following specific data fields (keys):
- Silhouette Score: Ranges from -1 to 1. A high value indicates that items are well matched to their own cluster and poorly matched to neighboring clusters.
- Calinski-Harabasz: A ratio of dispersion. Higher scores generally indicate better-defined clusters.
- Davies-Bouldin: The average similarity measure of each cluster with its most similar cluster. Lower values indicate better clustering (0 is the minimum).
Outputs Types
| Output | Types |
|---|---|
Clustered data |
DataFrame |
Metrics |
DataFrame, Dict |
You can check the list of supported types here: Available Type Hints.
Options
The Clu. OPTICS brick contains some changeable options:
- Min. Samples
- The minimum number of items required to form a dense region (a cluster). A lower value (e.g., 2 or 3) detects smaller, more fragmented clusters. A higher value (e.g., 10+) requires larger groups to form a cluster, smoothing over small variations.
- Max. Epsilon (Distance)
- The maximum distance between two points for them to be considered neighbors.
- High Value (Default): Allows the algorithm to inspect the entire dataset structure.
- Low Value: Limits the search radius, potentially speeding up processing but possibly fragmenting clusters.
- Distance Metric
- The formula used to calculate the distance between two points.
- Euclidean: Standard straight-line distance (like using a ruler).
- Manhattan: Grid-like distance (like walking city blocks). Good for high-dimensional data.
- Cosine: Measures the angle between points.
- Chebyshev: The greatest distance along any single dimension.
- Cluster Method
- The specific approach used to extract clusters from the calculated reachability plot.
- xi: Identifies clusters by looking for significant steep changes (drops) in density. Generally more flexible for varying densities.
- dbscan: Extracts clusters using a fixed threshold. This mimics the behavior of the DBSCAN algorithm.
- Xi (Steepness Threshold)
- Only used if Cluster Method is set to "xi". It determines how steep the density contrast must be to define a cluster boundary.
- Higher value (near 1.0): Only very sharp changes in density create new clusters.
- Lower value (near 0.0): Subtle changes in density will create separate clusters.
- Search Algorithm
- The underlying algorithm used to compute nearest neighbors.
- Auto: Automatically selects the best method based on your data size and structure.
- KD-Tree: Efficient for lower-dimensional data.
- Ball-Tree: Efficient for higher-dimensional data.
- Brute: Forcefully calculates all distances. Slow for large data but exact.
- Leaf Size
- A technical optimization parameter for the Tree algorithms. It affects the speed of the search and memory usage. The default is usually sufficient.
- Cluster Column Name
- The name of the new column created in the Clustered_data output. Default is
cluster_id. - Metrics Output Format
- Choose how the Metrics output is structured.
- Verbose
- If enabled, detailed progress logs will be printed to the console during execution.
import logging
import pandas as pd
import polars as pl
import numpy as np
from scipy import sparse
from sklearn.cluster import OPTICS
from sklearn.metrics import (
silhouette_score,
calinski_harabasz_score,
davies_bouldin_score,
)
from coded_flows.types import DataFrame, Tuple, Union, Dict, Any
from coded_flows.utils import CodedFlowsLogger
logger = CodedFlowsLogger(name="Clu. OPTICS", level=logging.INFO)
def _validate_numerical_data(data):
"""
Validates if the input data contains only numerical (integer, float) or boolean values.
"""
if sparse.issparse(data):
if not (
np.issubdtype(data.dtype, np.number) or np.issubdtype(data.dtype, np.bool_)
):
raise TypeError(
f"Sparse matrix contains unsupported data type: {data.dtype}."
)
return
if isinstance(data, np.ndarray):
if not (
np.issubdtype(data.dtype, np.number) or np.issubdtype(data.dtype, np.bool_)
):
raise TypeError(
f"NumPy array contains unsupported data type: {data.dtype}."
)
return
if isinstance(data, (pd.DataFrame, pd.Series)):
if isinstance(data, pd.Series):
if not (
pd.api.types.is_numeric_dtype(data) or pd.api.types.is_bool_dtype(data)
):
raise TypeError(f"Pandas Series '{data.name}' is not numerical.")
else:
numeric_cols = data.select_dtypes(include=["number", "bool"])
if numeric_cols.shape[1] != data.shape[1]:
invalid_cols = list(set(data.columns) - set(numeric_cols.columns))
raise TypeError(
f"Pandas DataFrame contains non-numerical columns: {invalid_cols}."
)
return
if isinstance(data, (pl.DataFrame, pl.Series)):
if isinstance(data, pl.Series):
if not (data.dtype.is_numeric() or data.dtype == pl.Boolean):
raise TypeError(f"Polars Series '{data.name}' is not numerical.")
else:
for col_name, dtype in zip(data.columns, data.dtypes):
if not (dtype.is_numeric() or dtype == pl.Boolean):
raise TypeError(
f"Polars DataFrame contains non-numerical column: '{col_name}'."
)
return
raise ValueError(f"Unsupported data type: {type(data)}")
def clu_optics(
data: DataFrame, options=None
) -> Tuple[DataFrame, Union[DataFrame, Dict]]:
options = options or {}
verbose = options.get("verbose", True)
min_samples = options.get("min_samples", 5)
max_eps = options.get("max_eps", 100000.0)
metric = options.get("metric", "euclidean").lower()
cluster_method = options.get("cluster_method", "xi").lower()
xi = options.get("xi", 0.05)
algorithm = options.get("algorithm", "auto").lower().replace("-", "_")
leaf_size = options.get("leaf_size", 30)
cluster_col_name = options.get("cluster_column", "cluster_id")
metrics_as = options.get("metrics_as", "Dataframe")
AUTOMATED_THRESHOLD = 10000
Clustered_data = None
Metrics = None
try:
verbose and logger.info(
f"Initializing Clu. OPTICS (method={cluster_method}, min_samples={min_samples})."
)
is_pandas = isinstance(data, pd.DataFrame)
is_polars = isinstance(data, pl.DataFrame)
if not (is_pandas or is_polars):
raise ValueError("Input data must be a pandas or polars DataFrame.")
verbose and logger.info("Validating numerical requirements...")
_validate_numerical_data(data)
n_samples = data.shape[0]
verbose and logger.info(f"Processing {n_samples} samples.")
final_max_eps = np.inf if max_eps >= 100000000.0 else max_eps
Model = OPTICS(
min_samples=min_samples,
max_eps=final_max_eps,
metric=metric,
cluster_method=cluster_method,
xi=xi,
algorithm=algorithm,
leaf_size=leaf_size,
n_jobs=-1,
)
verbose and logger.info("Fitting model and predicting labels...")
labels = Model.fit_predict(data)
n_clusters_found = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
verbose and logger.info(
f"Found {n_clusters_found} clusters. Noise points: {n_noise}."
)
unique_labels = set(labels)
if len(unique_labels) < 2:
verbose and logger.warning(
"Not enough clusters (or only noise) found to calculate metrics. Returning NaNs."
)
(s_score, ch_score, db_score) = (np.nan, np.nan, np.nan)
else:
verbose and logger.info("Computing Intrinsic metrics.")
if n_samples > AUTOMATED_THRESHOLD:
s_score = silhouette_score(
data, labels, sample_size=min(n_samples, 20000), random_state=42
)
verbose and logger.info(f"Silhouette Score : {s_score:.4f} (Sampled)")
else:
s_score = silhouette_score(data, labels)
verbose and logger.info(f"Silhouette Score : {s_score:.4f}")
ch_score = calinski_harabasz_score(data, labels)
verbose and logger.info(f"Calinski-Harabasz : {ch_score:.4f}")
db_score = davies_bouldin_score(data, labels)
verbose and logger.info(f"Davies-Bouldin : {db_score:.4f}")
metric_names = ["Silhouette Score", "Calinski-Harabasz", "Davies-Bouldin"]
metric_values = [s_score, ch_score, db_score]
if metrics_as == "Dataframe":
Metrics = pd.DataFrame({"Metric": metric_names, "Value": metric_values})
else:
Metrics = dict(zip(metric_names, metric_values))
if is_pandas:
Clustered_data = data.copy()
Clustered_data[cluster_col_name] = labels
verbose and logger.info(
"Results attached to a copy of Pandas DataFrame (original preserved)."
)
elif is_polars:
Clustered_data = data.with_columns(
pl.Series(name=cluster_col_name, values=labels)
)
verbose and logger.info("Results attached to Polars DataFrame.")
except Exception as e:
verbose and logger.error(f"Clu. OPTICS operation failed: {e}")
raise
return (Clustered_data, Metrics)
Brick Info
- shap>=0.47.0
- scikit-learn
- pandas
- numpy
- scipy
- numba>=0.56.0
- polars