
BSMM8740-2-R-2025F [WEEK - 5]
Last time we introduced the Tidymodels framework in R
We showed how we can use the Tidymodels framework to create a workflow for data prep, feature engineering, model fitting and model evaluation.
Today we look at the using the Tidymodels package to build classification and clustering models.
Eager learners are machine learning algorithms that first build a model from the training dataset before making any prediction on future datasets. They spend more time on the training process to better generalize from the data.
They usually require less time to make predictions.
Logistic Regression:
Decision Trees:
Random Forests:
Support Vector Machines (SVM):
Artificial Neural Networks:
Gradient Boosting Machines (GBM):
Lazy learners or instance-based learners, do not create any model immediately from the training data, and this where the lazy aspect comes from. They just memorize the training data, and each time there is a need to make a prediction, they predict based on similarity between the query measurement and stored measurements.
Example lazy learners are:
k-Nearest Neighbors (k-NN):
Case-based Learning:
Logistic regression is a Generalized Linear Model where the dependent (categorical) variable \(y\) is binary, i.e. takes values in \(\{0,1\}\) (e.g., yes/no, success/failure).
This can be interpreted as identifying two classes, and logistic regression provides a prediction for class membership based on a linear combination of the explanatory variables.
Logistic regression is an example of supervised learning.
For the logistic GLM:
It follows that \(\pi = \frac{e^\eta}{1+e^\eta} = \frac{1}{1+e^{-\eta}}\), which is a sigmoid function in the explanatory variables. The equation \(\eta=0\) defines a linear decision boundary or classification threshold.

\[\begin{align*} \pi & =\frac{1}{1+e^{-\eta}}\;\text{(logistic function)}\\ \log\left(\frac{\pi}{1-\pi}\right) & =\log\left(\frac{\frac{1}{1+e^{-\eta}}}{1-\frac{1}{1+e^{-\eta}}}\right)\\ & =\log\left(\frac{\frac{1}{1+e^{-\eta}}}{\frac{e^{-\eta}}{1+e^{-\eta}}}\right)\\ &=\log\left(e^{\eta}\right)=\eta \end{align*}\]
The term \(\frac{\pi}{1-\pi}\) is called the the odds-ratio. By its definition:
\[ \frac{\pi}{1-\pi}=e^{\beta_0+\beta_1 x_1+\beta_2 x_2+\beta_2 x_2\ldots+\beta_n x_n} \]
So if \(x_1\) changes by one unit (\(x_1\rightarrow x_1+1\)), then the odds ratio changes by \(e^{\beta_1}\).
The confusion matrix is a 2x2 table summarizing the number of correct predictions of the model (a function of the decision boundary): It is the foundation for understanding other evaluation metrics.
| predict 1 | predict 0 | |
|---|---|---|
| data = 1 | true positives (TP) | false negatives (FN)1 |
| data = 0 | false positives (FP)2 | true negatives (TN) |
Accuracy measures the percent of correct predictions:
\[ \begin{align*} \frac{\text{TP}+\text{TN}}{\text{observation count}} \end{align*} \]
Accuracy is a useful metric in evaluating classification models under certain conditions:
Balanced Datasets: When the classes in the dataset are roughly equal in number, accuracy can be a reliable indicator of model performance. For example, if you have a dataset where 50% of the samples are class A and 50% are class B, accuracy is a good measure of whether the model correctly predicts the classes.
General Performance: For an overall sense of a model’s performance, accuracy provides a straightforward, easy-to-understand measure, giving the proportion of correct predictions out of all predictions made.
Accuracy has several limitations, especially in the context of imbalanced datasets:
Imbalanced Datasets:
Ignoring Class Importance:
Lack of Insight into Model Behavior:
Fraud Detection:
Spam Detection:
Precision measures the percent of positive predictions that are correct (true positives / all positives predicted):
\[ \frac{\text{TP}}{\text{TP}+\text{FP}} \]
Measures the success at predicting the first class (true positives predicted / actual positives):
\[ \frac{\text{TP}}{\text{TP}+\text{FN}}\qquad\text{(True Positive Rate - TPR)} \]
Measures the success at predicting the second class (true negatives predicted / actual negative):
\[ \frac{\text{TN}}{\text{TN}+\text{FP}}\qquad\text{(True Negative Rate - TNR)} \]
Receiver Operating Characteristic (ROC) curve & the Area Under the Curve (AUC)
ROC Curve: Plot of the true positive rate (Recall) against the false positive rate (1 - Specificity) at various threshold settings.
AUC: The area under the ROC curve, representing the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance.
Consider plotting the TPR against the FPR (1-TNR) at different classification thresholds.
So, how to compute?
> data <- ISLR::Default %>% tibble::as_tibble()
> set.seed(8740)
>
> # split data
> data_split <- rsample::initial_split(data)
> default_train <- rsample::training(data_split)
>
> # create a recipe
> default_recipe <- default_train %>%
+ recipes::recipe(formula = default ~ student + balance + income) %>%
+ recipes::step_dummy(recipes::all_nominal_predictors())
>
> # create a linear regression model
> default_model <- parsnip::logistic_reg() %>%
+ parsnip::set_engine("glm") %>%
+ parsnip::set_mode("classification")
>
> # create a workflow
> default_workflow <- workflows::workflow() %>%
+ workflows::add_recipe(default_recipe) %>%
+ workflows::add_model(default_model)> # fit the model
> lm_fit <-
+ default_workflow %>%
+ parsnip::fit(default_train)
>
> # augment the data with the predictions using the model fit
> training_results <-
+ broom::augment(lm_fit , default_train)
>
> training_results %>% dplyr::slice_head(n=6)# A tibble: 6 × 7
.pred_class .pred_No .pred_Yes default student balance income
<fct> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
1 No 0.998 0.00164 No No 759. 45774.
2 No 1.00 0.000145 No Yes 452. 19923.
3 No 0.770 0.230 Yes No 1666. 30070.
4 No 1.00 0.000256 No No 434. 57146.
5 No 1.00 0.000449 No No 536. 32994.
6 No 0.973 0.0269 No No 1252. 32721.
Classification Threshold
Recall:
Classification Threshold Impact
True Positives (TP) and False Positives (FP):
True Negatives (TN) and False Negatives (FN):
Trade-offs:
Fraud Detection: - In fraud detection, missing a fraudulent transaction (false negative) might be more costly than flagging a legitimate transaction as fraud (false positive). - You might choose a lower threshold to ensure higher sensitivity (recall), even if it means a higher false positive rate, thereby catching more fraudulent transactions.
The choice of classification threshold in computing the ROC curve is crucial for balancing the trade-offs between sensitivity and specificity, and ultimately for optimizing the model’s performance in a way that aligns with business goals and context. Understanding and carefully selecting the appropriate threshold ensures that the model’s predictions are most useful and cost-effective for the specific application.
Bayes Rule - using the rules of conditional probability:
\[ \mathbb{P}(A,B) = \mathbb{P}(A|B)\mathbb{P}(B) = \mathbb{P}(B|A)\mathbb{P}(A) \] We can write:
\[ \mathbb{P}(B|A) = \frac{\mathbb{P}(A|B)\mathbb{P}(B)}{\mathbb{P}(A)} \]
This method starts with Bayes rule:
for \(K\) classes (\(C_1,\ldots C_k\)) and an observation \(x\) consisting of \(N\) features \(\{x_1,\ldots,x_N\}\), since \(\mathbb{P}\left[\left.C_{k}\right|x_{1},\ldots,x_{N}\right]\times\mathbb{P}\left[x_{1},\ldots,x_{N}\right]\) is equal to \(\mathbb{P}\left[\left.x_{1},\ldots,x_{N}\right|C_{k}\right]\times\mathbb{P}\left[C_{k}\right]\), we can write
\[ \mathbb{P}\left[\left.C_{k}\right|x_{1},\ldots,x_{N}\right]=\frac{\mathbb{P}\left[\left.x_{1},\ldots,x_{N}\right|C_{k}\right]\times\mathbb{P}\left[C_{k}\right]}{\mathbb{P}\left[x_{1},\ldots,x_{N}\right]} \]
If we assume that the features are all independent we can write Bayes rule as
\[ \mathbb{P}\left[\left.C_{k}\right|x_{1},\ldots,x_{N}\right]=\frac{\mathbb{P}\left[C_{k}\right]\times\prod_{n=1}^{N}\mathbb{P}\left[\left.x_{n}\right|C_{k}\right]}{\prod_{n=1}^{N}\mathbb{P}\left[x_{n}\right]} \]
and since the denominator is independent of \(C_{k}\), our classifier is
\[ C_{k}=\arg\max_{C_{k}}\mathbb{P}\left[C_{k}\right]\prod_{n=1}^{N}\mathbb{P}\left[\left.x_{n}\right|C_{k}\right] \]
So it remains to calculate the class probability \(\mathbb{P}\left[C_{k}\right]\) and the conditional probabilities \(\mathbb{P}\left[\left.x_{n}\right|C_{k}\right]\)
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the conditional probabilities.
If our features are all ordinal (i.e. categorical), then
The class probabilities1 are simply the frequency of observations that belong to each class divided by the total number of observations.
The conditional probabilities are the frequency of each feature value for a given class value divided by the frequency of measurements with that class value.
If any features are numeric, we can estimate conditional probabilities by assuming that the numeric features have a Gaussian distribution for each class
> library(discrim)
> # create a naive bayes classifier
> default_model_nb <- parsnip::naive_Bayes() %>%
+ parsnip::set_engine("klaR") %>%
+ parsnip::set_mode("classification")
>
> # create a workflow
> default_workflow_nb <- workflows::workflow() %>%
+ workflows::add_recipe(default_recipe) %>%
+ workflows::add_model(default_model_nb)
>
> # fit the model
> lm_fit_nb <-
+ default_workflow_nb %>%
+ parsnip::fit(
+ default_train
+ , control =
+ workflows::control_workflow(parsnip::control_parsnip(verbosity = 1L))
+ )
>
> # augment the data with the predictions using the model fit
> training_results_nb <-
+ broom::augment(lm_fit_nb , default_train) The k-nearest neighbors algorithm, also known as KNN, kNN, or k-NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.
It is typically used as a classification algorithm, working off the assumption that similar class predictions can be made by predictors near one another.
For classification problems, a class label is assigned on the basis of a majority vote—i.e. the label that is most frequently represented around a given data point is used.
Before a classification can be made, the distance between points must be defined. Euclidean distance is most commonly used.
Note that the KNN algorithm is also part of a family of “lazy learning” models, meaning that it only stores a training dataset versus undergoing a training stage. This also means that all the computation occurs when a classification or prediction is being made.
The k value in the k-NN algorithm determines how many neighbors will be checked to determine the classification of a specific query point.
Our classification workflow only differs by the model, e.g.:
k-NN regression
To use k-NN for a regression problem, calculate the mean or median (or another aggregate measure) of the dependent variable among the k neighbors.
In the context of k-Nearest Neighbors (kNN) classification, while the general evaluation metrics like accuracy, precision, recall, F11 score, and others are commonly used, there are no unique metrics that are exclusively specific to kNN. However, there are certain considerations and additional analyses that are particularly relevant when evaluating a kNN model.
Choice of k (Number of Neighbors): The value of k affects the performance of a kNN model. Testing the model with various values of k and evaluating the performance using standard metrics (like accuracy, F1 score) can help to select the best k.
Feature Scaling Sensitivity: kNN is sensitive to the scale of the features because it relies on calculating distances. Evaluate the model’s performance before and after feature scaling (like Min-Max scaling or Z-score normalization).
Curse of Dimensionality: kNN can perform poorly with high-dimensional data (many features). Evaluating the model’s performance in relation to the number of features (dimensionality) can be important. Dimensionality reduction techniques like PCA might be used with kNN.
The SVM assumes a training set of the form \((x_1,y_1),\ldots,(x_n,y_n)\) where the \(y_i\) are either \(-1\) or \(1\), indicating the class to which each \(x_i\) belongs.
The SVM algorithm looks to find the maximum-margin hyperplane that divides the group of points \(x_i\) for which \(Y=-1\) from the group for which \(Y=1\), such that the distance between the hyperplane and the nearest point \(x_i\) from either group is maximized.
Illustration of SVM max-margin principle
Support Vector Machines solve a constrained optimization problem to find the optimal separating hyperplane. For a binary classification problem with training data \((x_i, y_i)\) where \(y_i \in \{-1, +1\}\), the hyperplane is defined by:
\[ \begin{equation} w^T x + b = 0 \end{equation} \]
where \(w\) is the weight vector (normal to the hyperplane) and \(b\) is the bias term1.
To show that \(w\) is normal to the hyperplane:
pick two points \(x_1,x_2\) on the hyperplane (i.e. both satisfy \(w^T x + b = 0\)). Note that their difference \(v=x_1-x_2\) lies entirely in the hyperplane.
compute \(w^Tv = w^T(x_1-x_2)\). Both satisfy \(w^T x + b = 0\rightarrow w^T x = -b\)), and so \(w^Tv=0\). Thus \(w\) is perpendicular to any vector in the hyperplane.
The margin width between the two classes (that the hyperplane perfectly separates) is proportional to \(\frac{1}{||w||}\), so to maximize the width we want to minimize \(||w||\).
So we solve for
\[ \min_{w,b}\frac{1}{2}||w||^2 \]
with the constraint that all points are beyond the margins.
For real-world data with noise (i.e. not perfectly separable), we introduce slack variables \(\xi_i\ge 0\): and solve
\[ \min_{w,b}\left(\frac{1}{2}||w||^2 + C\sum_{i=1}^n\xi_i\right) \]
with the constraint that all points are with \(\xi_i\) of the margin, for all \(i\).
The parameter \(C\) controls the trade-off between margin maximization and misclassification penalty.
Support vectors are the data points that lie closest to the decision surface (or hyperplane)
They are the data points most difficult to classify
They have direct bearing on the optimum location of the decision surface
Support vectors are the elements of the training set that would change the position of the dividing hyperplane if removed
The linear SVM constructs a linear decision boundary (hyperplane) to separate classes in the feature space. It aims to find the hyperplane that maximizes the margin between the closest points (support vectors) of the classes. The decision function is \(f(x) = w \cdot x + b\), where \(w\) is the weight vector and \(b\) is the bias term.
There are similar SVM methods that are adapted to more complex boundaries.
If the data is not separable:
a transformation of data may make them separable
an embedding in a higher dimensional space might make them separable (the “kernel trick”)

Polynomial SVM uses a polynomial kernel to create a non-linear decision boundary. It transforms the input features into higher-dimensional space where a linear separation is possible.
The polynomial kernel is \(K(x, x') = (w \cdot x + b)^d\), where \(d\) is the degree of the polynomial. This kernel is implemented though parsnip::svm_poly and engine kernlab.
RBF SVM uses the Radial Basis Function (Gaussian) kernel to handle non-linear classification problems. It maps the input space into an infinite-dimensional space where a linear separation is possible.
The RBF kernel is \(K(x, x') = \exp\left(-\gamma \left\Vert x - x'\right\Vert^2\right)\), where \(\gamma\) controls the width of the Gaussian function. This kernel is implemented though parsnip::svm_rbf and engine kernlab.
| Feature | Linear SVM | Polynomial SVM | RBF SVM |
|---|---|---|---|
| Kernel Function | Linear | Polynomial | Radial Basis Function |
| Equation | \(w \cdot x + b\) | \((w \cdot x + b)^d\) | \(\exp\left(-\gamma \left\Vert x - x'\right\Vert^2\right)\) |
| Complexity | Low | Medium to High (depending on \(d\)) | High |
| Interpretability | High | Medium | Low |
| Computational Cost | Low | Medium to High (higher with increasing \(d\)) | High |
| Flexibility | Low | Medium to High | High |
| Risk of Overfitting | Low | Medium to High (higher with increasing \(d\)) | Medium to High (depends on \(\gamma\) and \(C\)) |
| Typical Use Cases | Linearly separable, high-dimensional spaces (e.g., text) | Data with polynomial relationships | Highly non-linear data, complex patterns |
scale_factor.class_weight parameter to ‘balanced’ or manually specifying weights.themis::step_smote()) or undersampling techniques to balance the class distribution before training the model.Classification and clustering serve different purposes in machine learning. Classification is a supervised learning technique used for predicting predefined labels, requiring labeled data and focusing on accuracy and interpretability.
Clustering, on the other hand, is an unsupervised learning technique used for discovering natural groupings in data, requiring no labeled data and focusing on exploratory data analysis and pattern discovery. Understanding the strengths and limitations of each method is crucial for applying them effectively to solve real-world problems.
Cluster analysis refers to algorithms that group similar objects into groups called clusters. The endpoint of cluster analysis is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
The purpose of cluster analysis is to help reveal patterns and structures within a dataset that may provide insights into underlying relationships and associations.
k-means is a method of unsupervised learning that produces a partitioning of observations into k unique clusters.
The goal of k-means is to minimize the sum of squared Euclidian distances between observations in a cluster and the centroid, or geometric mean, of that cluster.
In k-means clustering, observed variables (columns) are considered to be locations on axes in multidimensional space.
The basic k-means algorithm has the following steps.
There are three common methods for selecting initial centers:
Forgy, Lloyd, and MacQueen methods.Hartigan-Wong method.Because the initial conditions are based on random selection in both approaches, the k-means algorithm is not deterministic.
Running the clustering twice on the same data may not result in the same cluster assignments.
> # create recipe for 2-D clustering
> cluster_recipe <- data |>
+ recipes::recipe(~ x1 + x2, data = _)
>
> # specify the workflows
> all_workflows <-
+ workflowsets::workflow_set(
+ preproc = list(base = cluster_recipe),
+ models = list(tidyclust::k_means( num_clusters = parsnip::tune() ) )
+ )
> # create bootstrap samples
> dat_resamples <- data |> rsample::bootstraps(apparent = TRUE)
>
> tuned_results <-
+ all_workflows |>
+ workflow_map(
+ fn = "tune_cluster"
+ , resamples = dat_resamples
+ , grid = dials::grid_regular(dials::num_clusters(), levels = 10)
+ , metrics = tidyclust::cluster_metric_set(sse_within_total, sse_total, sse_ratio)
+ , control = tune::control_grid(save_pred = TRUE, extract = identity)
+ )> set.seed(8740)
>
> centers <- tibble::tibble(
+ cluster = factor(1:4),
+ num_points = c(100, 150, 50, 90), # number points in each cluster
+ x1 = c(5, 0, -3, -4), # x1 coordinate of cluster center
+ x2 = c(-1, 1, -2, 1.5), # x2 coordinate of cluster center
+ )
>
> labelled_points <-
+ centers |>
+ dplyr::mutate(
+ x1 = purrr::map2(num_points, x1, rnorm),
+ x2 = purrr::map2(num_points, x2, rnorm)
+ ) |>
+ dplyr::select(-num_points) |>
+ tidyr::unnest(cols = c(x1, x2))
>
> p <- ggplot(labelled_points, aes(x1, x2, color = cluster)) +
+ geom_point(alpha = 0.3) +
+ geom_point(data = centers, size = 10, shape = "o")
> p> # create recipe
> labelled_points_recipe <- labelled_points |>
+ recipes::recipe(~ x1 + x2, data = _)
> # create model spec
> kmeans_spec <- tidyclust::k_means( num_clusters = 4 )
> # create workflow
> wflow <- workflows::workflow() |>
+ workflows::add_model(kmeans_spec) |>
+ workflows::add_recipe(labelled_points_recipe)
> # fit workflow & extract centroids
> cluster_centers <- wflow |>
+ parsnip::fit(labelled_points) %>% tidyclust::extract_centroids() |>
+ dplyr::mutate( cluster = stringr::str_extract(.cluster,"\\d") )
> # plot
> p + geom_point(data = cluster_centers, size = 10, shape = "x")Hierarchical Clustering, sometimes called Agglomerative Clustering, is a method of unsupervised learning that produces a dendrogram, which can be used to partition observations into clusters (see tidyclust)
For other clustering algorithms, see
Feature Scaling: Most clustering algorithms benefit from feature scaling.
Choosing the Right Algorithm: Depends on the size, dimensionality of data, and the nature of the clusters.
Evaluation: Since clustering is unsupervised, evaluating the results can be subjective and is often based on domain knowledge.
We have looked at several classification algorithms in the context of tidymodels workflows.
We also looked at clustering and several algorithms in the tidyclust package.