Skip to content

Analyst Handbook

The Analyst Handbook serves as a guide for analysts to perform exploratory data analysis and extract actionable insights using simple models within the xVector Platform. It also provides insights into orchestration and observability within data workflows. The handbook uses four business cases, tied to key modeling approaches (Regression, Classification, Clustering, and Time Series), to contextualize these concepts, with a focus on Marketing Analytics applications.

  • The first business case focuses on marketing campaign and sales data. It uses linear regression to optimize marketing spend across different advertising channels and maximize sales revenue. The company’s historical data on marketing campaigns and sales figures is explored to identify which channels provide the best ROI and determine the expected sales impact.

  • The second business case involves optimizing marketing strategies for a bank. It uses a random forest classification model to analyze bank marketing data to identify factors that drive campaign success and target customer segments that are most likely to respond positively.

  • The third business case aims to identify and understand customer segments based on purchasing behaviors. It uses KMeans clustering to analyze online retail transaction data to improve customer retention and maximize revenue by understanding customer segments and their purchasing behaviors.

  • The fourth business case involves analyzing and forecasting sales trends using store sales data. It uses an ARIMA time series model to identify peak sales periods, understand growth trends, and uncover seasonal patterns to optimize inventory, plan promotions, and enhance revenue predictability.

These business cases, taken from kaggle, will help you get familiar with the xVector Platform.

The handbook provides information on evaluating metrics and model comparison, and discusses topics such as data exploration and data snooping.

Advanced modeling techniques, evaluations, and ML operations are discussed in Data Scientist Handbook.

Data operations such as data quality, pipelines, and advanced enrichment functions are covered in the Data Engineering Handbook.

xVector is a unified platform for building data applications and agents powered by a MetaGraph. Users can bring in data from various sources, enrich the data, explore, apply advanced modeling techniques, derive insights, and act on them, all in a single pane, collaboratively.


Business Case 1 (Regression): Marketing Campaign and Sales Data

Section titled “Business Case 1 (Regression): Marketing Campaign and Sales Data”

Consider a business that would like to optimize marketing spend across different advertising channels to maximize sales revenue. This involves determining the effectiveness of TV, social media, radio, and influencer promotions in driving sales and understanding how to allocate budgets for the best return on investment (ROI).

The company has historical data available on the marketing campaigns, including budgets spent on TV, social media, radio, and influencer collaborations, alongside the corresponding sales figures. However, the question remains: how can the company predict sales more accurately, identify which channels provide the best ROI, and determine the expected sales impact per $1,000 spent?

This journey begins by exploring the data, which includes sales figures and promotional budgets across different channels. Raw data is rarely in a usable form right from the start. We address potential biases, handle missing values, and identify outliers that could distort the results. With a clean and well-prepared dataset, the next step is to dive deeper into the data to extract meaningful insights.

To make informed decisions on marketing spend, businesses need to understand how each advertising channel influences sales. The relationship between marketing spend and sales is complex, with many factors at play. By fitting a linear regression model to the data, we can estimate how changes in the marketing budget for each channel influence sales. This helps identify which channels yield the highest sales per dollar spent and provides a framework for making more informed budget allocation decisions. For instance, the model might show that spending on TV ads yields the highest return on investment, while spending on social media or radio could be less effective, guiding future budget allocations.

Having identified effective channels, it is important to ensure the accuracy and reliability of the predictions. R² and Mean Squared Error are measures of the model’s performance. R² score, in particular, indicates how well the model explains the variance in sales based on marketing spend, with a higher score suggesting that the model can predict sales more accurately. On the other hand, the Mean Squared Error (MSE) measures the average squared difference between predicted and actual sales, helping to assess the quality of the predictions — lower MSE values indicate a better fit of the model to the data.

By evaluating these metrics, businesses gain confidence in the model’s ability to make reliable predictions. With these insights, companies can fine-tune their marketing strategies, reallocate budgets to the highest-performing channels, and identify areas where additional investment may not yield optimal results.

Now, let us look at how all this can be achieved in the xVector Platform.

You can download the Marketing Campaign and Sales Data from kaggle. This data contains:

  • TV promotion budget (in million)
  • Social Media promotion budget (in million)
  • Radio promotion budget (in million)
  • Influencer: Whether the promotion collaborates with Mega, Macro, Nano, Micro influencer
  • Sales (in million)

Analysis Questions:

  • Which advertising channel provides the best ROI?
  • How accurately can we predict sales from advertising spend?
  • What’s the expected sales impact per $1000 spent on each channel?

xVector has a catalog of connectors. If required, you can build connectors to custom sources and formats.

Below are the steps to implement this in the xVector Platform:

Once the data is imported, create a dataset for enrichment purposes. xVector provides the capability to keep these datasets synchronized with the original data sources, ensuring consistency.

Data exploration (suggested checklist) entails understanding the process that generates the data and the characteristics of the data. In this case, the process refers to the marketing department’s spending on various channels, as captured in the systems. As for some of the data characteristics, the dataset has very few missing values. The social media column has only six missing values out of around 4570 records. These records can either be removed or populated with average values or values provided by the business user from another system, etc. Influencer is a categorical value, and there are 4 unique values for this column.

xVector provides out-of-the-box tools to profile the data. To explore the data further, you can create reports manually or by using the GenAI-powered options. Generate Exploratory Report and Generate Report are GenAI reports built on the platform.

To view the data profile page, click on the kebab menu (vertical ellipses) → “View Profile” on the created dataset. Profile view helps identify outliers and correlations.

Kebab menu showing View Profile option

Below is the profile page view of the dataset:

Profile page view of the marketing dataset

Once you perform basic exploration of the data, you can then enrich the data. For example, “dropna” is one such function used to drop records with null values.

Enrichment function: dropna to remove null records

In the current case, the business would like to optimize marketing spend across different advertising channels to maximize sales revenue. Using the regression model, we can predict sales based on various input factors (such as TV, social media, radio, and influencer spend).

Having created the dataset and explored the data, we are now ready to build a linear regression model to analyze and make predictions.

In the world of xVectorlabs, each model is a hub of exploration, where experiments are authored to test various facets of the algorithm. A user can create multiple experiments under a model. An experiment includes one or more runs. Under each experiment, various parameters with available drivers can be tried on different features.

Experiments can have multiple runs with different input parameters and performance metrics as output. Based on the metric, one of these runs can be chosen for the final model.

The platform provides a comprehensive set of model drivers curated based on industry best practices. Advanced users can author their custom drivers, if required.

For the current dataset, we will use an existing model driver, Sklearn-LinearRegression. Follow the steps to build a linear regression model and use the model Sklearn-LinearRegression. Use appropriate parameters and evaluation metrics to optimize the solution.

Here is a linear regression model implemented in xVector.

The below run shows the parameters used along with the metrics and scores for analysis.

Model run parameters and metrics

Which Advertising Channel provides the best ROI?

  • The channel with the highest positive coefficient in the regression model has the greatest impact on sales per dollar spent. In linear regression, coefficients are numerical values that represent the relationship between predictor variables and the response variable. They indicate the strength and direction of the relationship and are multiplied by the predictor values in the regression equation. A positive coefficient means that as the predictor increases, the response variable also increases, while a negative coefficient indicates an inverse relationship.

  • In the above example, TV provides the best ROI as TV has the max. coefficient of 3.29.

Regression coefficients showing TV with highest ROI

This can also be inferred from the correlation matrix in the profile page of the dataset. In this case TV has the maximum correlation which is 0.99.

Correlation matrix showing TV at 0.99 correlation

How accurately can we predict sales from Advertising Spend?

  • The R² score measures the proportion of variance in sales explained by advertising spend. Closer to 1 implies high accuracy.
    • In the above example, it is quite accurate as R² score is 0.98
  • The Mean Absolute Error (MAE) quantifies the average error between actual and predicted sales. A lower MAE indicates that the model’s predictions are closer to the actual values.
    • Mean Absolute Error is 4.18, which is low implying the predictions are closer to the actual values.
Model evaluation metrics: R² and MAE

What’s the Expected Sales Impact per $1000 spent on each Channel?

  • Use the Impact per $1000 column from the coefficients DataFrame.
  • The coefficient of TV is 3.2938. This predicts that spending $1000 more on TV ads is expected to increase sales by $3293.8.
Impact per $1000 spent on each advertising channel

Negative Coefficients

  • A negative coefficient suggests an inverse relationship between the corresponding feature and the outcome variable. Specifically:
    • Influencer (-0.1142): Spending $1000 on Micro-Influencers reduces the outcome by roughly $114.20.

Possible Explanations for Negative Impacts:

  • Diminishing Returns: These marketing channels might already be saturated, leading to diminishing or negative returns on additional investment.
  • Ineffective Strategy: The investment in these areas may not be optimized, or the target audience might not respond well to these channels.
  • Indirect Effects: The spending might be cannibalizing other channels or producing unintended negative outcomes (e.g., customer annoyance, ad fatigue).

Business Case 2 (Classification): Bank Marketing Dataset

Section titled “Business Case 2 (Classification): Bank Marketing Dataset”

The objective of the Bank is to maximize Term deposits from customers by optimizing marketing strategies. This can be done by identifying the factors that drive campaign success, understanding the overall campaign performance, and targeting customer segments most likely to respond positively.

First, we explore the dataset, which contains customer demographics, past campaign data, and behavioral features such as job, education, and balance. The current dataset has 10 categorical columns: marital status, education, and job.

To predict whether a customer will subscribe to a term deposit, we use the Random Forest classification model. This model is chosen for its ability to handle complex, non-linear relationships between features and the ability to provide feature importance rankings to identify the most influential predictors.

By continuously validating and refining the model, the bank ensures its marketing campaigns remain data-driven, efficient, and impactful, leading to improved conversion rates and better resource allocation.

You can download the Bank Marketing Dataset from kaggle.

The Bank tracks customer term deposits along with their attributes in their systems. The dataset has 10 categorical features including Age, Marital Status, Job, Loan, Education, term deposit etc. There are 12 distinct job categories and 3 distinct marital status categories.

Analysis Questions:

  • What factors best predict campaign success?
  • What’s the overall campaign success rate? **

** To be done in the Data Scientist Handbook

Once the data is imported, create a dataset for enrichment purposes. xVector provides the capability to keep these datasets synchronized with the original data sources, ensuring consistency.

Data exploration (suggested checklist) entails understanding the process that generates the data and the characteristics of the data. The Bank Marketing dataset includes 17 attributes, with features such as customer demographics (e.g., age, job, marital status, education), financial details (e.g., balance, loan, housing), and engagement data (e.g., previous campaign outcomes, duration of calls, and contact methods). The target variable, deposit, indicates whether the customer subscribed (yes or no). The age range for this dataset is between 18 and 95 with about 57% of them being married.

The dataset also has negative amounts as balance for a few records. Depending on how the bank is tracking balances, it is possible that these customers have withdrawn more than what’s available in their account. So, we shouldn’t drop these records without understanding how the Bank tracks balances.

xVector provides out-of-the-box tools to profile the data. Generate Report and Generate Exploratory Report are GenAI reports built on the platform.

Kebab menu showing View Profile option for bank marketing dataset Profile page view of the bank marketing dataset

Once you perform basic exploration of the data, you can then enrich the data. For example, “dropna” is one such function used to drop records with null values.

Dataset preview showing enrichment options

We can use the classification model due to its ability to handle complex, non-linear relationships between features and the ability to provide feature importance rankings to identify the most influential predictors.

For the current dataset, we will use an existing model driver, RandomForest. Follow the steps to build a classification model and use the model driver RandomForest. Use appropriate parameters and evaluation metrics to optimize the solution.

Here is a classification model implemented in xVector.

Random forest model run parameters and metrics

What factors best predict campaign success?

Feature importance from the Random Forest model reveals the most influential factors. In this case, the top 3 features are duration, balance, and age.

Feature importance chart showing duration, balance, and age as top predictors

**What’s the overall campaign success rate? ****

The proportion of deposit == ‘yes’ gives the success rate. Here, the overall Campaign Success Rate is 47.38%

**Which customer segments are most likely to respond positively? ****

Based on the heatmap below, those with management jobs and tertiary education are most likely to respond positively.

Heatmap showing management jobs and tertiary education as strongest positive responders

Business Case 3 (Clustering): Online Retail Transaction Data

Section titled “Business Case 3 (Clustering): Online Retail Transaction Data”

An online retail store would like to identify and understand customer segments based on purchasing behaviors to improve customer retention and maximize revenue. By distinguishing the most valuable customers, the company can create targeted marketing strategies, enhance loyalty programs, and optimize resource allocation to increase long-term profitability.

The Online Retail Transaction dataset includes transactional details such as invoice numbers, stock codes, product descriptions, quantities, invoice dates, and customer IDs, along with the country of purchase. The primary goal is to use this information to segment customers based on their purchase behavior and determine which segments represent the most valuable customers.

The analysis begins with data exploration and preparation, a critical step for ensuring accuracy and reliability. Once the data is cleaned and enriched, the focus shifts to determining the optimal number of groups for segmentation using the elbow method. The final step is to analyze the features that are most important for grouping or segmenting the dataset.

You can download the Online Retail Transaction Data Source from Kaggle.

This dataset includes transactional details such as invoice numbers, stock codes, product descriptions, quantities, invoice dates, customer IDs, unit price, and the country of purchase. Out of the 38 countries, UK far exceeds the others in sales of these products.

Analysis Questions:

  • Are there outliers in the data points?
  • What should be the number of groups to segment the data points?
  • What are the important features to consider while grouping or segmenting dataset?

Once the data is imported, create a dataset for enrichment purposes. Data exploration (suggested checklist) entails understanding the process that generates the data and the characteristics of the data. The Online Retail Transaction dataset contains records of customer purchases. There are several records with negative quantity which are not valid values. The invoices range between December 2010 and December 2011. Around 500K records belong to United Kingdom.

xVector provides out-of-the-box tools to profile the data. Generate Exploratory Report and Generate Report are GenAI reports built on the platform.

Kebab menu showing View Profile option for retail dataset Profile page view of the online retail dataset

Once you perform basic exploration of the data, you can then enrich the data. For example, “dropna” is one such function used to drop records with null values.

Enrichment function: dropna to remove null records

Here is a link to the Profile page on an xVector DataApp:

xVector DataApp profile page

The K-Means clustering model is ideal for this dataset as it effectively segments customers into meaningful groups based on their purchasing behavior, uncovering patterns in the data.

For the current dataset, we will use an existing model driver, KMeans-with-Elbow. Follow the steps to build a KMeans Clustering model and use the model driver KMeans-with-Elbow. Use appropriate parameters and evaluation metrics to optimize the solution.

Here is a KMeans Clustering model implemented in xVector.

KMeans model run parameters and metrics

Are there any outliers in the data points?

Using KMeans clustering, we can detect unusual patterns in transaction amounts or quantities. In the current dataset, we only see a very sparse set of datapoints that are outliers:

Outlier scatter plot showing Quantity vs StockCode with cluster labels

How many optimal groups can the data points be categorized into?

In the current scenario, based on the below plot, we can have 3 groups.

Elbow/inertia plot showing optimal k=3 clusters

What are the main features we should consider for the grouping?

Feature importance for clustering showing stockcode and country as main features

The above plot indicates that stockcode and country should be considered as the main features while grouping to make business decisions.

Both enriched data and customer segmentation information can be sent to target systems to operationalize insights. This data can be saved to a destination, such as an S3 bucket, as a new file for downstream use.


Business Case 4 (Time Series): Store Sales Data

Section titled “Business Case 4 (Time Series): Store Sales Data”

A store would like to analyze and forecast sales trends to improve decision-making for store operations and marketing. Understanding sales dynamics is critical for effective inventory management, planning promotions, and predicting future sales performance. The primary focus of this analysis is to determine whether the data is stationary, identify trends or seasonal patterns, and explore peak sales periods while forecasting future sales.

The ARIMA time series model is employed to forecast future sales while accounting for these trends and patterns. ARIMA is chosen for its ability to handle both autoregressive (AR) and moving average (MA) components while incorporating differencing to make the data stationary.

You can download the Store Sales Time Series Data from Kaggle.

Here, the store captures sales of products. The dataset gives the sales over the span of 8 years. This is a small dataset with no null values.

Analysis Questions:

  • Is data stationary?
  • Does it have trend or seasonality?
  • What are the peak sales periods? ***
  • What’s the overall sales growth trend? ***
  • Are there clear seasonal patterns in sales? ***

*** Note: This entails advanced techniques that will be addressed in the Data Science Handbook. For now, the assumption is that these reports are available for analysis.

Once the data is imported, create a dataset for enrichment purposes. Data exploration (suggested checklist) — the Store Sales dataset tracks the number of transactions on a given day. There are no missing values in this dataset.

xVector provides out-of-the-box tools to profile the data. Generate Exploratory Report and Generate Report are GenAI reports built on the platform.

Kebab menu showing View Profile option for store sales dataset Profile page view of the store sales dataset

Once you perform basic exploration of the data, you can then enrich the data. For example, “filter” is one such function used to filter appropriate records.

Enrichment function: filter to select appropriate records

For the current dataset, we will use an existing model driver, stats_ARIMA. Follow the steps to build a time series model and use the model stats_ARIMA. Use appropriate parameters and evaluation metrics to optimize the solution.

Here is a time series model implemented in xVector.

ARIMA model run parameters and metrics

Is data stationary?

Stationarity test results showing non-stationary data

Based on the above analysis, the dataset is non-stationary. This implies that the data shows trends, seasonality, or varying variance over time.

Does it have trend or seasonality?

Trend and seasonality decomposition visualization

The autocorrelation function (ACF) and partial autocorrelation function (PACF) plots help in understanding the temporal dependencies in the data. If the ACF plot shows a slow decay, this suggests the presence of a trend. If you notice spikes at regular intervals, this indicates the presence of seasonality — suggesting the need for seasonal differencing or a SARIMA model.

**What are the peak sales periods? *****

Peak sales chart showing December peaks each year

December of every year has peak sales.

**What’s the overall sales growth trend? *****

Overall sales growth trend line

**Are there clear seasonal patterns in sales? *****

  • Holiday Patterns: Sales peak in December (holiday season) and November (Black Friday).
  • Long-Term Trends: Seasonal patterns might change over time, indicating shifting consumer behavior.
  • Overall Growth: The trend component reveals whether the store’s sales are growing, stagnating, or declining.
Dataset visualizations showing sales patterns Additional dataset visualizations

Sales forecast:

ARIMA sales forecast projection
Typical data analysis process flowchart

Exploring data (often called Exploratory Data Analysis or EDA) is a critical process of examining and understanding a dataset before diving into formal modeling.

Context & Source — What business process/system generated the data? What is the purpose of collecting this data? What is the unit of observation (row = transaction, customer etc.)? What is the time coverage (start & end dates)? Are there known biases?

Data Quality — Are there missing values? Duplicate records? Are numeric values within expected ranges? Are categorical values standardized (e.g., “FB” vs. “Facebook”)? Are timestamps correctly formatted? Are there anomalies (e.g., negative revenue, invalid dates)?

Data Structure — What are the data types (numeric, categorical, text, datetime)? Which fields can serve as keys or identifiers? How many records and features? What’s the distribution of key numeric fields? What’s the frequency of categorical values?

Behavior & Relationships — Are there trends over time? Correlations between key variables? Outliers that may distort the analysis? Have the statistical properties of data changed over time (data drift)?

Visualization Techniques — Histograms for data distribution, box plots for spread and outliers, scatter plots for relationships, correlation matrices for feature interactions, heat maps for complex patterns.

Chart type selection guide

Linear Regression: Parameters and Evaluating Metrics

Section titled “Linear Regression: Parameters and Evaluating Metrics”

Parameters

Parameters are the configurable settings and learned values that define how a machine learning model operates and makes predictions. They control the model’s complexity, learning behavior, and decision-making process. Proper parameter selection and tuning are crucial for model performance, as they directly influence the model’s ability to generalize to new, unseen data.

Scikit-learn provides details of the parameters for Linear Regression model.

ModelParameterDescriptionUsage
Linear Regressionfit_interceptWhether to calculate the intercept for the regression modelSet False if the data is already centered
normalizeNormalizes input features. Deprecated in recent Scikit-learn versionsHelps with features on different scales
test_sizeSize of test dataHelps with splitting train and test data
Ridge RegressionalphaL2 regularization strength. Larger values shrink coefficients morePrevents overfitting by reducing model complexity
solverOptimization algorithm: auto, saga, etc.Impacts convergence speed and stability for large datasets
Lasso RegressionalphaL1 regularization strength. Controls sparsity of coefficientsUseful for feature selection
max_iterMaximum iterations for optimizationImpacts convergence for large or complex datasets
XGBoost (Regression)eta (learning rate)Step size for updating predictionsLower values make learning slower but more robust
max_depthMaximum depth of treesHigher values can capture complex relationships but risk overfitting
colsample_bytreeFraction of features sampled for each treeIntroduces randomness, reducing overfitting

Evaluating Metrics

Regression models predict continuous values, so the metrics focus on measuring the difference between predicted and actual values.

In the formulas below: n = number of observations, ŷ = predicted value, y = actual value, SSR = Sum of Squares Regression, TSS = Total Sum of Squares.

MetricDescription
Mean Absolute Error (MAE)Measures the average magnitude of errors without considering their direction. A lower MAE indicates better model performance. It’s easy to interpret but doesn’t penalize large errors as much as MSE.
Mean Squared Error (MSE)Computes the average squared difference between actual and predicted values. Penalizes larger errors more than MAE, making it sensitive to outliers.
Root Mean Squared Error (RMSE)Square root of MSE; represents errors in the same unit as the target variable. Balances interpretability and sensitivity to large errors.
R² Score (Coefficient of Determination)Proportion of variance explained by the model. Values range from 0 to 1, where 1 means perfect prediction. Negative values indicate poor performance.
Adjusted R²Adjusts R² for the number of predictors in the model, by penalizing the addition of irrelevant features. Useful for comparing models with different numbers of predictors.
Mean Absolute Percentage Error (MAPE)Measures error as a percentage of actual values, making it scale-independent. Useful for scale-independent evaluation but struggles with very small actual values.

Formulas:

MAE:

MAE formula

MSE:

MSE formula

RMSE:

RMSE formula

R² Score:

R² Score formula

Adjusted R²:

Adjusted R² formula

MAPE:

MAPE formula

Classification: Parameters and Evaluating Metrics

Section titled “Classification: Parameters and Evaluating Metrics”

Parameters

Parameters are the configurable settings and learned values that define how a machine learning model operates and makes predictions. They control the model’s complexity, learning behavior, and decision-making process. Proper parameter selection and tuning are crucial for model performance, as they directly influence the model’s ability to generalize to new, unseen data.

Scikit-learn provides details of the parameters for Random Forest Classification model.

ModelParameterDescriptionUsage
Random Forest Classifiern_estimatorsNumber of trees in the forestAffects accuracy and training speed; larger forests usually perform better
max_featuresNumber of features to consider when splittingReduces overfitting and speeds up training
bootstrapWhether to sample data with replacementImproves diversity among trees
Logistic RegressionpenaltyType of regularization: l1, l2, elasticnet, or noneAdds constraints to model coefficients to prevent overfitting
solverOptimization algorithm: liblinear, saga, lbfgs, etc.Determines how the model is optimized, with some solvers supporting specific penalties
CInverse of regularization strength. Smaller values increase regularizationBalances bias and variance
max_iterMaximum number of iterations for optimizationEnsures convergence for complex problems
Support Vector Machine (SVM)CRegularization parameter. Smaller values create larger margins but may underfitControls the trade-off between misclassification and margin size
kernelKernel type: linear, rbf, poly, or sigmoidDetermines how data is transformed into higher dimensions
gammaKernel coefficient for non-linear kernelsImpacts the decision boundary for non-linear kernels like rbf or poly
Decision Tree ClassifiercriterionFunction to measure split quality: gini or entropyControls how splits are chosen (impurity vs. information gain)
max_depthMaximum depth of the treePrevents overfitting by restricting the complexity of the tree
min_samples_splitMinimum samples required to split a nodeEnsures that nodes are not split with very few samples
min_samples_leafMinimum samples required in a leaf nodePrevents overfitting by ensuring leaves have sufficient data
K-Nearest Neighbors (KNN)n_neighborsNumber of neighbors to consider for classificationAffects granularity of classification; smaller values lead to more localized decisions
weightsWeighting function: uniform (equal weight) or distance (closer points have higher weight)Impacts how neighbors influence the prediction
metricDistance metric: minkowski, euclidean, manhattan, etc.Defines how distances between data points are calculated
Naive Bayesvar_smoothingPortion of variance added to stabilize calculationsPrevents division by zero for features with very low variance
XGBoost (Classification)objectiveSpecifies the learning task: binary:logistic, multi:softprob, etc.Matches the classification type (binary or multiclass)
scale_pos_weightBalances positive and negative classes for imbalanced datasetsEssential for tasks like fraud detection where class imbalance is significant
max_depthMaximum depth of treesHigher values increase model complexity but risk overfitting
eta (learning rate)Step size for updating predictionsSmaller values lead to slower, more accurate training
gammaMinimum loss reduction required for further tree splitsHigher values make the model more conservative

Evaluating Metrics

Classification models predict discrete labels, so the metrics measure the correctness of those predictions.

In the formulas below: TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives.

MetricDescription
AccuracyRatio of correct predictions to total predictions. Works well for balanced datasets but fails for imbalanced ones.
PrecisionFraction of true positive predictions among all positive predictions. High precision minimizes false positives.
Recall (Sensitivity)Fraction of actual positives that were correctly predicted. High recall minimizes false negatives.
F1 ScoreHarmonic mean of precision and recall. Best suited for imbalanced datasets.
Confusion MatrixTabular representation of true positives, true negatives, false positives, and false negatives. Helps visualize classification performance.
ROC-AUC ScoreMeasures the trade-off between true positive rate (TPR) and false positive rate (FPR). Evaluates a classifier’s ability to distinguish between classes at various thresholds. Higher AUC indicates better performance.

Formulas:

Accuracy:

Accuracy formula

Precision:

Precision formula

Recall:

Recall formula

F1 Score:

F1 Score formula

Clustering: Parameters and Evaluating Metrics

Section titled “Clustering: Parameters and Evaluating Metrics”

Parameters

Parameters are the configurable settings and learned values that define how a machine learning model operates and makes predictions. They control the model’s complexity, learning behavior, and decision-making process. Proper parameter selection and tuning are crucial for model performance, as they directly influence the model’s ability to generalize to new, unseen data.

Scikit-learn provides details of the parameters for KMeans clustering.

ModelParameterDescriptionUsage
K-Meansn_clustersNumber of clusters to formControls the number of groups/clusters in the data
initInitialization method for centroids: k-means++, randomk-means++ is better for convergence
max_iterMaximum number of iterations to run the algorithmPrevents infinite loops and ensures convergence
tolTolerance for convergenceStops the algorithm when the centroids’ movement is smaller than this value
n_initNumber of times the K-Means algorithm will be run with different centroid seedsEnsures better centroids and better performance
DBSCANepsMaximum distance between two points to be considered neighborsDetermines cluster density
min_samplesMinimum number of points required to form a dense region (a cluster)Larger values lead to fewer but denser clusters
metricDistance metric used for clustering: euclidean, manhattan, etc.Affects the way distances are calculated between points
Agglomerative Clusteringn_clustersNumber of clusters to formSpecifies the number of clusters to form at the end of the clustering process
linkageDetermines how to merge clusters: ward, complete, average, or singleAffects how clusters are combined (Ward minimizes variance)
affinityMetric used to compute distances: euclidean, manhattan, cosine, etc.Affects the distance measure between data points during clustering
K-Medoidsn_clustersNumber of clusters to formSpecifies the number of clusters (like K-Means but uses medoids)
metricDistance metric for pairwise dissimilarityDefines the method for calculating pairwise distances between points
max_iterMaximum number of iterations to run the algorithmEnsures termination after a certain number of iterations
Gaussian Mixture Modeln_componentsNumber of mixture components (clusters)Determines the number of Gaussian distributions (clusters)
covariance_typeType of covariance matrix: full, tied, diag, or sphericalDefines how the covariance of the components is calculated
tolConvergence thresholdStops iteration if log-likelihood change is smaller than tol
max_iterMaximum number of iterations for the EM algorithmEnsures the algorithm stops after a fixed number of iterations

Evaluating Metrics

Clustering models are unsupervised, so metrics evaluate the quality of the clusters formed.

MetricDescription
Silhouette ScoreMeasures how well clusters are separated and how close points are within a cluster. Ranges from -1 to 1. Higher values indicate well-separated and compact clusters.
Davies-Bouldin IndexMeasures the average similarity ratio of each cluster with the most similar cluster. It measures intra-cluster similarity relative to inter-cluster separation. Lower is better. Evaluates compactness and separation of clusters.
Calinski-Harabasz ScoreRatio of cluster separation to cluster compactness. Higher values indicate better-defined clusters.
Adjusted Rand Index (ARI)Compares the clustering result to a ground truth (if available). Adjusts for chance clustering.
Mutual Information ScoreMeasures agreement between predicted clusters and ground truth labels. Higher values indicate better alignment.

Timeseries: Parameters and Evaluating Metrics

Section titled “Timeseries: Parameters and Evaluating Metrics”

Parameters

Parameters are the configurable settings and learned values that define how a machine learning model operates and makes predictions. They control the model’s complexity, learning behavior, and decision-making process. Proper parameter selection and tuning are crucial for model performance, as they directly influence the model’s ability to generalize to new, unseen data.

Scikit-learn provides details of the parameters for ARIMA Timeseries.

ModelParameterDescriptionUsage
ARIMApNumber of lag observations (autoregressive part)Captures dependency on past values
dDegree of differencing to make the series stationaryRemoves trends from the data
qNumber of lagged forecast errors (moving average part)Models dependency on past prediction errors
SARIMAseasonal_orderTuple (P, D, Q, m) where m is the season lengthAdds seasonal components to ARIMA
trendSpecifies long-term trend behavior: n (none), c (constant), or t (linear)Helps model global trends in data
weekly_seasonalityWhether to include weekly seasonality (True/False or int for harmonics)Useful for datasets with strong weekly patterns like retail sales
XGBoost (for Time Series)max_depthMaximum depth of trees used for feature-based time series modelingCaptures complex temporal relationships
eta (learning rate)Step size for updating predictions in gradient boostingLower values improve robustness but require more iterations
colsample_bytreeFraction of features sampled for each treeReduces overfitting and adds diversity
subsampleFraction of training instances sampled for each boosting iterationIntroduces randomness to prevent overfitting
objectiveLearning task, e.g., reg:squarederror for regression tasksMatches the regression nature of time series forecasting
lambdaL2 regularization term on weightsControls overfitting by penalizing large coefficients
alphaL1 regularization term on weightsAdds sparsity, which is helpful for feature selection
boosterType of booster: gbtree, gblinear, or dartTree-based (gbtree) is most common for time series
LSTMunitsNumber of neurons in each LSTM layerHigher values increase model capacity but risk overfitting
input_shapeShape of input data (timesteps, features)Specifies the window of historical data and number of features
return_sequencesWhether to return the full sequence (True) or the last output (False)Use True for stacked LSTMs or sequence outputs
dropoutFraction of neurons randomly dropped during training (e.g., 0.2)Prevents overfitting by adding regularization
recurrent_dropoutFraction of recurrent connections dropped during trainingAdds regularization to the temporal dependencies
optimizerAlgorithm for adjusting weights (e.g., adam, sgd)Controls how the model learns from errors
lossLoss function (e.g., mse, mae, huber)Determines how prediction errors are minimized
batch_sizeNumber of sequences processed together during trainingSmaller batches generalize better but take longer to train
epochsNumber of complete passes over the training datasetToo many epochs may lead to overfitting
timestepsNumber of past observations used to predict future valuesDetermines the window of historical data analyzed for prediction
Orbitresponse_colName of the column containing the target variable (e.g., sales)Specifies which variable is being forecasted
date_colName of the column containing datesIdentifies the time index for forecasting
seasonalitySeasonal periods (e.g., weekly, monthly, yearly)Models seasonality explicitly, crucial for periodic patterns in time-series data
seasonality_sm_inputNumber of Fourier terms used for seasonality approximationControls the smoothness of seasonality; higher values increase granularity
level_sm_inputSmoothing parameter for the level component (between 0 and 1)Determines how quickly the model adapts to recent changes in level
growth_sm_inputSmoothing parameter for the growth componentAdjusts the sensitivity of the growth trend over time
estimatorOptimizer used for parameter estimation (stan-map, pyro-svi, etc.)stan-map for faster optimization, pyro-svi for full Bayesian inference
prediction_percentilesPercentiles for the uncertainty intervals (default: [5, 95])Defines the confidence intervals for forecasts
num_warmupNumber of warmup steps in sampling (used in Bayesian methods)Higher values improve parameter estimation but increase computation time
num_samplesNumber of posterior samples drawn (used in Bayesian methods)Ensures good posterior estimates; higher values yield more robust uncertainty estimates
regressor_colName(s) of columns used as regressorsIncorporates additional covariates into the model (e.g., holidays, promotions)

Evaluating Metrics

Time series models focus on predicting sequential data, so metrics measure the alignment of predicted values with the observed trend. In addition to MAE, MSE, RMSE, and MAPE (see formulas above):

MetricDescription
Mean Absolute Error (MAE)See regression metrics above
Mean Squared Error (MSE)See regression metrics above. Penalizes larger errors more than MAE, making it sensitive to outliers in time series.
Root Mean Squared Error (RMSE)See regression metrics above. Evaluates prediction accuracy in the original scale of the data.
Mean Absolute Percentage Error (MAPE)See regression metrics above. Useful for scale-independent evaluation but struggles with very small actual values.
Symmetric Mean Absolute Percentage Error (sMAPE)Variant of MAPE, mitigates issues with small denominators.
Dynamic Time Warping (DTW)Measures similarity between two time series, even if they are misaligned.
R² ScoreEvaluates variance explained by the time series model.

sMAPE Formula:

sMAPE formula
AspectRegressionClassificationClusteringTime Series
PurposePredicts continuous valuesAssigns to categoriesGroups by similarityPredicts future values
OutputContinuous (e.g., prices)Labels (e.g., spam/not)Cluster labelsFuture predictions
LearningSupervisedSupervisedUnsupervisedSupervised/Unsupervised
AlgorithmsLinear, Gradient BoostingLogistic, Random Forest, SVMK-Means, DBSCANARIMA, LSTM, Prophet
Use CasesPrice/sales predictionFraud detection, diagnosisCustomer segmentationSales/traffic forecasting

These principles are taken from “Learning from data”, a Caltech course by Yaser Abu-Mostafa: https://work.caltech.edu/telecourse

Occam’s Razor — Prefer simpler models that adequately fit the data to reduce overfitting.

Bias and Variance — Bias is error from oversimplifying (underfitting). Variance is error from overcomplexity (overfitting). A good model balances both. High Variance signs: training MSE low but test MSE much higher. High Bias signs: both training and test MSE are high.

Data Snooping — Avoid tailoring models too closely to specific datasets through repeated testing. Prevent it by separating data properly, using k-fold cross-validation, performing feature engineering only on training data, keeping a final holdout set untouched, and documenting data usage transparently.