Clu. Gaussian Mixture
Performs clustering using Gaussian Mixture Models (GMM). Assumes data points are generated from a mixture of Gaussian distributions.
Clu. Gaussian Mixture
Processing
This brick groups your data points into clusters using a Gaussian Mixture Model (GMM). Unlike simpler clustering methods (like K-Means) that assume clusters are circular, GMM assumes data points come from a mixture of different Gaussian distributions. This allows it to find clusters that are elliptical, tilted, or of different sizes.
It calculates the probability that each data point belongs to a specific cluster and assigns it to the most likely one. It is particularly useful for complex datasets where clusters might overlap or have elongated shapes.
Inputs
- data
- The dataset you want to analyze and cluster. This must contain only numerical (numbers) or boolean (True/False) values. The algorithm relies on mathematical calculations, so text columns must be removed or encoded before using this brick.
Inputs Types
| Input | Types |
|---|---|
data |
DataFrame |
You can check the list of supported types here: Available Type Hints.
Outputs
- Clustered data
- A copy of your original dataset with a new column added. This new column contains the Cluster ID (an integer starting from 0) assigned to each row.
- Metrics
- A report containing scoring metrics that evaluate how well the data was clustered. Higher scores generally indicate better defined clusters (except for Davies-Bouldin, where lower is better).
The Metrics output contains the following specific data fields (keys or rows):
- Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters. Range: -1 to 1 (Higher is better).
- Calinski-Harabasz: Also known as the Variance Ratio Criterion. Measures the ratio of dispersion between clusters vs. within clusters. (Higher is better).
- Davies-Bouldin: Measures the average similarity between each cluster and its most similar one. (Lower is better, with 0 being the lowest possible score).
Outputs Types
| Output | Types |
|---|---|
Clustered data |
DataFrame |
Metrics |
DataFrame, Dict |
You can check the list of supported types here: Available Type Hints.
Options
The Clu. Gaussian Mixture brick contains some changeable options:
- Number of Components (k)
- The number of clusters you want the algorithm to find.
- Covariance Type
- Controls the shape and orientation of the clusters the model is allowed to find.
- Full: Clusters can be any shape, size, or orientation (most flexible, but computationally expensive).
- Tied: All clusters must have the same shape, size, and orientation.
- Diag: Clusters can be different sizes but must be aligned with the axes (ellipses, not tilted).
- Spherical: Clusters must be spherical (circles/spheres) and can vary in size.
- Initialization Method
- How the algorithm selects the starting points for the clusters.
- KMeans: Uses standard K-Means to find starting centers.
- K-Means++: Uses a smarter K-Means initialization (spreads out centers) to speed up convergence.
- Random: Picks random starting parameters.
- Random From Data: Picks random data points as initial centers.
- Number of Initializations
- The number of times the algorithm will run with different starting points. The best result is kept. Increasing this helps prevent getting stuck in a "bad" solution but takes longer.
- Max. Iterations
- The maximum number of times the algorithm will update its calculations to try and find the best fit.
- Convergence Threshold (tol)
- The precision level. If the improvement between updates is smaller than this number, the algorithm stops early because it has "converged."
- Regularization Covariance
- A very small number added to the calculation to ensure mathematical stability (prevents errors if a cluster becomes too small or flat).
- Cluster Column Name
- The name of the new column that will be added to your dataset containing the cluster labels. Defaults to
cluster_id. - Metrics Output Format
- How you want the performance scores returned.
- Random Seed
- A number that ensures the random initialization is the same every time you run the workflow. Useful for reproducibility.
- Verbose
- If enabled, detailed progress logs will be printed during execution.
import logging
import pandas as pd
import polars as pl
import numpy as np
from scipy import sparse
from sklearn.mixture import GaussianMixture
from sklearn.metrics import (
silhouette_score,
calinski_harabasz_score,
davies_bouldin_score,
)
from coded_flows.types import DataFrame, Tuple, Union, Dict, Any
from coded_flows.utils import CodedFlowsLogger
logger = CodedFlowsLogger(name="Clu. GMM", level=logging.INFO)
def _validate_numerical_data(data):
"""
Validates if the input data contains only numerical (integer, float) or boolean values.
"""
if sparse.issparse(data):
if not (
np.issubdtype(data.dtype, np.number) or np.issubdtype(data.dtype, np.bool_)
):
raise TypeError(
f"Sparse matrix contains unsupported data type: {data.dtype}."
)
return
if isinstance(data, np.ndarray):
if not (
np.issubdtype(data.dtype, np.number) or np.issubdtype(data.dtype, np.bool_)
):
raise TypeError(
f"NumPy array contains unsupported data type: {data.dtype}."
)
return
if isinstance(data, (pd.DataFrame, pd.Series)):
if isinstance(data, pd.Series):
if not (
pd.api.types.is_numeric_dtype(data) or pd.api.types.is_bool_dtype(data)
):
raise TypeError(f"Pandas Series '{data.name}' is not numerical.")
else:
numeric_cols = data.select_dtypes(include=["number", "bool"])
if numeric_cols.shape[1] != data.shape[1]:
invalid_cols = list(set(data.columns) - set(numeric_cols.columns))
raise TypeError(
f"Pandas DataFrame contains non-numerical columns: {invalid_cols}."
)
return
if isinstance(data, (pl.DataFrame, pl.Series)):
if isinstance(data, pl.Series):
if not (data.dtype.is_numeric() or data.dtype == pl.Boolean):
raise TypeError(f"Polars Series '{data.name}' is not numerical.")
else:
for col_name, dtype in zip(data.columns, data.dtypes):
if not (dtype.is_numeric() or dtype == pl.Boolean):
raise TypeError(
f"Polars DataFrame contains non-numerical column: '{col_name}'."
)
return
raise ValueError(f"Unsupported data type: {type(data)}")
def clu_gaussian_mixture(
data: DataFrame, options=None
) -> Tuple[DataFrame, Union[DataFrame, Dict]]:
options = options or {}
verbose = options.get("verbose", True)
n_components = options.get("n_clusters", 3)
covariance_type = options.get("covariance_type", "Full").lower()
init_params = options.get("init_params", "KMeans").lower().replace(" ", "_")
n_init = options.get("n_init", 1)
max_iter = options.get("max_iter", 100)
tol = options.get("tol", 0.001)
reg_covar = options.get("reg_covar", 1e-06)
cluster_col_name = options.get("cluster_column", "cluster_id")
metrics_as = options.get("metrics_as", "Dataframe")
random_state = options.get("random_state", 42)
AUTOMATED_THRESHOLD = 10000
Clustered_data = None
Metrics = None
try:
verbose and logger.info(
f"Initializing Clu. GMM (components={n_components}, type={covariance_type})."
)
is_pandas = isinstance(data, pd.DataFrame)
is_polars = isinstance(data, pl.DataFrame)
if not (is_pandas or is_polars):
raise ValueError("Input data must be a pandas or polars DataFrame.")
verbose and logger.info("Validating numerical requirements...")
_validate_numerical_data(data)
n_samples = data.shape[0]
verbose and logger.info(f"Processing {n_samples} samples.")
Model = GaussianMixture(
n_components=n_components,
covariance_type=covariance_type,
tol=tol,
reg_covar=reg_covar,
max_iter=max_iter,
n_init=n_init,
init_params=init_params,
random_state=random_state,
)
verbose and logger.info("Fitting model and predicting labels...")
labels = Model.fit_predict(data)
n_clusters_found = len(set(labels))
verbose and logger.info(f"Resulting number of clusters: {n_clusters_found}")
if n_clusters_found < 2:
verbose and logger.warning(
"Less than 2 clusters found. Metrics cannot be calculated. Returning NaNs."
)
(s_score, ch_score, db_score) = (np.nan, np.nan, np.nan)
else:
verbose and logger.info("Computing Intrinsic metrics.")
if n_samples > AUTOMATED_THRESHOLD:
s_score = silhouette_score(
data, labels, sample_size=min(n_samples, 20000), random_state=42
)
verbose and logger.info(f"Silhouette Score : {s_score:.4f} (Sampled)")
else:
s_score = silhouette_score(data, labels)
verbose and logger.info(f"Silhouette Score : {s_score:.4f}")
ch_score = calinski_harabasz_score(data, labels)
verbose and logger.info(f"Calinski-Harabasz : {ch_score:.4f}")
db_score = davies_bouldin_score(data, labels)
verbose and logger.info(f"Davies-Bouldin : {db_score:.4f}")
metric_names = ["Silhouette Score", "Calinski-Harabasz", "Davies-Bouldin"]
metric_values = [s_score, ch_score, db_score]
if metrics_as == "Dataframe":
Metrics = pd.DataFrame({"Metric": metric_names, "Value": metric_values})
else:
Metrics = dict(zip(metric_names, metric_values))
if is_pandas:
Clustered_data = data.copy()
Clustered_data[cluster_col_name] = labels
verbose and logger.info("Results assigned in-place to Pandas DataFrame.")
elif is_polars:
Clustered_data = data.with_columns(
pl.Series(name=cluster_col_name, values=labels)
)
verbose and logger.info("Results attached to Polars DataFrame.")
except Exception as e:
verbose and logger.error(f"Clu. GMM operation failed: {e}")
raise
return (Clustered_data, Metrics)
Brick Info
- shap>=0.47.0
- scikit-learn
- pandas
- numpy
- scipy
- numba>=0.56.0
- polars