Analyst Handbook

The Analyst Handbook serves as a guide for analysts to perform exploratory data analysis and extract actionable insights using simple models within the xVector Platform. It also provides insights into orchestration and observability within data workflows. The handbook uses four business cases, tied to key modeling approaches (Regression, Classification, Clustering, and Time Series), to contextualize these concepts, with a focus on Marketing Analytics applications.

The first business case focuses on marketing campaign and sales data. It uses linear regression to optimize marketing spend across different advertising channels and maximize sales revenue. The company’s historical data on marketing campaigns and sales figures is explored to identify which channels provide the best ROI and determine the expected sales impact.
The second business case involves optimizing marketing strategies for a bank. It uses a random forest classification model to analyze bank marketing data to identify factors that drive campaign success and target customer segments that are most likely to respond positively.
The third business case aims to identify and understand customer segments based on purchasing behaviors. It uses KMeans clustering to analyze online retail transaction data to improve customer retention and maximize revenue by understanding customer segments and their purchasing behaviors.
The fourth business case involves analyzing and forecasting sales trends using store sales data. It uses an ARIMA time series model to identify peak sales periods, understand growth trends, and uncover seasonal patterns to optimize inventory, plan promotions, and enhance revenue predictability.

These business cases, taken from kaggle, will help you get familiar with the xVector Platform.

The handbook provides information on evaluating metrics and model comparison, and discusses topics such as data exploration and data snooping.

Advanced modeling techniques, evaluations, and ML operations are discussed in Data Scientist Handbook.

Data operations such as data quality, pipelines, and advanced enrichment functions are covered in the Data Engineering Handbook.

xVector Platform Overview

xVector is a unified platform for building data applications and agents powered by a MetaGraph. Users can bring in data from various sources, enrich the data, explore, apply advanced modeling techniques, derive insights, and act on them, all in a single pane, collaboratively.

Business Case 1 (Regression): Marketing Campaign and Sales Data

Consider a business that would like to optimize marketing spend across different advertising channels to maximize sales revenue. This involves determining the effectiveness of TV, social media, radio, and influencer promotions in driving sales and understanding how to allocate budgets for the best return on investment (ROI).

The company has historical data available on the marketing campaigns, including budgets spent on TV, social media, radio, and influencer collaborations, alongside the corresponding sales figures. However, the question remains: how can the company predict sales more accurately, identify which channels provide the best ROI, and determine the expected sales impact per $1,000 spent?

This journey begins by exploring the data, which includes sales figures and promotional budgets across different channels. Raw data is rarely in a usable form right from the start. We address potential biases, handle missing values, and identify outliers that could distort the results. With a clean and well-prepared dataset, the next step is to dive deeper into the data to extract meaningful insights.

To make informed decisions on marketing spend, businesses need to understand how each advertising channel influences sales. The relationship between marketing spend and sales is complex, with many factors at play. By fitting a linear regression model to the data, we can estimate how changes in the marketing budget for each channel influence sales. This helps identify which channels yield the highest sales per dollar spent and provides a framework for making more informed budget allocation decisions. For instance, the model might show that spending on TV ads yields the highest return on investment, while spending on social media or radio could be less effective, guiding future budget allocations.

Having identified effective channels, it is important to ensure the accuracy and reliability of the predictions. R² and Mean Squared Error are measures of the model’s performance. R² score, in particular, indicates how well the model explains the variance in sales based on marketing spend, with a higher score suggesting that the model can predict sales more accurately. On the other hand, the Mean Squared Error (MSE) measures the average squared difference between predicted and actual sales, helping to assess the quality of the predictions — lower MSE values indicate a better fit of the model to the data.

By evaluating these metrics, businesses gain confidence in the model’s ability to make reliable predictions. With these insights, companies can fine-tune their marketing strategies, reallocate budgets to the highest-performing channels, and identify areas where additional investment may not yield optimal results.

Now, let us look at how all this can be achieved in the xVector Platform.

Dataset Overview

You can download the Marketing Campaign and Sales Data from kaggle. This data contains:

TV promotion budget (in million)
Social Media promotion budget (in million)
Radio promotion budget (in million)
Influencer: Whether the promotion collaborates with Mega, Macro, Nano, Micro influencer
Sales (in million)

Analysis Questions:

Which advertising channel provides the best ROI?
How accurately can we predict sales from advertising spend?
What’s the expected sales impact per $1000 spent on each channel?

Importing Data

xVector has a catalog of connectors. If required, you can build connectors to custom sources and formats.

Below are the steps to implement this in the xVector Platform:

Create a workspace.
Connect to data sources.

Understanding the Dataset

Once the data is imported, create a dataset for enrichment purposes. xVector provides the capability to keep these datasets synchronized with the original data sources, ensuring consistency.

Data exploration (suggested checklist) entails understanding the process that generates the data and the characteristics of the data. In this case, the process refers to the marketing department’s spending on various channels, as captured in the systems. As for some of the data characteristics, the dataset has very few missing values. The social media column has only six missing values out of around 4570 records. These records can either be removed or populated with average values or values provided by the business user from another system, etc. Influencer is a categorical value, and there are 4 unique values for this column.

xVector provides out-of-the-box tools to profile the data. To explore the data further, you can create reports manually or by using the GenAI-powered options. Generate Exploratory Report and Generate Report are GenAI reports built on the platform.

To view the data profile page, click on the kebab menu (vertical ellipses) → “View Profile” on the created dataset. Profile view helps identify outliers and correlations.

Below is the profile page view of the dataset:

Once you perform basic exploration of the data, you can then enrich the data. For example, “dropna” is one such function used to drop records with null values.

Enrichment function: dropna to remove null records

Maximizing Sales Revenue

In the current case, the business would like to optimize marketing spend across different advertising channels to maximize sales revenue. Using the regression model, we can predict sales based on various input factors (such as TV, social media, radio, and influencer spend).

Implementing the Solution

Having created the dataset and explored the data, we are now ready to build a linear regression model to analyze and make predictions.

In the world of xVectorlabs, each model is a hub of exploration, where experiments are authored to test various facets of the algorithm. A user can create multiple experiments under a model. An experiment includes one or more runs. Under each experiment, various parameters with available drivers can be tried on different features.

Experiments can have multiple runs with different input parameters and performance metrics as output. Based on the metric, one of these runs can be chosen for the final model.

The platform provides a comprehensive set of model drivers curated based on industry best practices. Advanced users can author their custom drivers, if required.

For the current dataset, we will use an existing model driver, Sklearn-LinearRegression. Follow the steps to build a linear regression model and use the model Sklearn-LinearRegression. Use appropriate parameters and evaluation metrics to optimize the solution.

Here is a linear regression model implemented in xVector.

The below run shows the parameters used along with the metrics and scores for analysis.

Analysis

Which Advertising Channel provides the best ROI?

The channel with the highest positive coefficient in the regression model has the greatest impact on sales per dollar spent. In linear regression, coefficients are numerical values that represent the relationship between predictor variables and the response variable. They indicate the strength and direction of the relationship and are multiplied by the predictor values in the regression equation. A positive coefficient means that as the predictor increases, the response variable also increases, while a negative coefficient indicates an inverse relationship.
In the above example, TV provides the best ROI as TV has the max. coefficient of 3.29.

Regression coefficients showing TV with highest ROI

This can also be inferred from the correlation matrix in the profile page of the dataset. In this case TV has the maximum correlation which is 0.99.

How accurately can we predict sales from Advertising Spend?

The R² score measures the proportion of variance in sales explained by advertising spend. Closer to 1 implies high accuracy.
- In the above example, it is quite accurate as R² score is 0.98
The Mean Absolute Error (MAE) quantifies the average error between actual and predicted sales. A lower MAE indicates that the model’s predictions are closer to the actual values.
- Mean Absolute Error is 4.18, which is low implying the predictions are closer to the actual values.

What’s the Expected Sales Impact per $1000 spent on each Channel?

Use the Impact per $1000 column from the coefficients DataFrame.
The coefficient of TV is 3.2938. This predicts that spending $1000 more on TV ads is expected to increase sales by $3293.8.

Impact per $1000 spent on each advertising channel

Negative Coefficients

A negative coefficient suggests an inverse relationship between the corresponding feature and the outcome variable. Specifically:
- Influencer (-0.1142): Spending $1000 on Micro-Influencers reduces the outcome by roughly $114.20.

Possible Explanations for Negative Impacts:

Diminishing Returns: These marketing channels might already be saturated, leading to diminishing or negative returns on additional investment.
Ineffective Strategy: The investment in these areas may not be optimized, or the target audience might not respond well to these channels.
Indirect Effects: The spending might be cannibalizing other channels or producing unintended negative outcomes (e.g., customer annoyance, ad fatigue).

Business Case 2 (Classification): Bank Marketing Dataset

The objective of the Bank is to maximize Term deposits from customers by optimizing marketing strategies. This can be done by identifying the factors that drive campaign success, understanding the overall campaign performance, and targeting customer segments most likely to respond positively.

First, we explore the dataset, which contains customer demographics, past campaign data, and behavioral features such as job, education, and balance. The current dataset has 10 categorical columns: marital status, education, and job.

To predict whether a customer will subscribe to a term deposit, we use the Random Forest classification model. This model is chosen for its ability to handle complex, non-linear relationships between features and the ability to provide feature importance rankings to identify the most influential predictors.

By continuously validating and refining the model, the bank ensures its marketing campaigns remain data-driven, efficient, and impactful, leading to improved conversion rates and better resource allocation.