xVector is a collaborative platform for building data applications. It is powered by MetaGraph, an intelligence engine that keeps track of all resources powering data applications. Businesses can connect, explore, experiment with algorithms, and drive outcomes rapidly. A single pane of glass enables data engineers, data scientists, business analysts, and users to extract value from data collaboratively.
Data Applications comprise all resources and related actions in creating value from data. The actions performed in a Data Application include connecting to various data sources, profiling the datasets for quality issues and anomalies, enriching data for further analysis, exploring the datasets to derive insights, mining patterns with advanced analytics and models, communicating the outputs to drive outcomes, and observing the applications for further enhancements and improvements.
The following sections describe each of the resources that constitute a data application. Each resource performs a specific function, enabling efficient division of labor and collaboration.
Workspace - Workspace provides a convenient way to visualize and organize the interactions across resources such as data sources, datasets, reports, and models that power a DataApp. Business analysts, data scientists, and data engineers have a single pane of glass to collaborate and version their work.
Read more: Workspace
Datasource - Enterprise data is available in files, databases, object stores, and cloud warehouses or is accessible via APIs in various applications. Datasource allows users to bring data from multiple sources for further processing. Data sources are in sync on a scheduled or on-demand basis. In addition to the first-party data, enterprises now have access to a large amount of third-party data to enhance the analysis.
Read more: Datasource.
Dataset - Once the data is available, users can transform and refine it using an enrichment process. Enrichment comprises functions to profile, detect anomalies, join other datasets, run a regression, classify or segment based on clustering algorithms, or manually edit numerical, text, or image data. Users can perform all actions that enable the training of models and their applications.
The delineation between a data source and a dataset allows for data traceability.
Read more: Dataset.
Model - Models enable the application of an appropriate analytical lens to the data. Supervised AI models, such as regression classification, or unsupervised models, such as clustering, allow users to tease patterns in tabular, image, or textual data. Time series models allow for forecasting. Users can extract entities, identify relevant topics, or understand sentiment in textual data.
Read more: Models
Reports—Reports provide a canvas for visualizing and exploring data. Users can build interactive dashboards, interrogate models, slice and dice the data, and drill down into details, allowing businesses to collaborate and refine their understanding of the underlying data.
Read more: Reports
DataDestination: Users can act on the insight by automating the output to be written to execution systems. For example, they identify customers who are likely to churn. In that case, they can send this data to a CRM system to deliver discounts to the customer or remediate with other actions.
Read more: Data Destination.
Xvector platform allows rapid prototyping using a draft server in the design phase. Business Users and Analysts can collaborate to define the data application and hand it off to the data scientists and engineers to further refine and tune for performance during the operational phase.
Users can collaborate on each resource, such as reports, datasets, and models. Users can edit, view, or comment on a resource based on their permissions. Users just need an email to start collaboration.
User groups allow for organizing and easier sharing across users.
Users can control the visibility/scope of a given resource. Making the resources public makes them visible to all users across the cluster. Each resource has a defined URL, enabling ease of sharing.
The user will need to start the draft driver in the workspace before using it. The icon to start the draft server is on the top right of the workspace, along with other admin icons. Once the draft driver is running, users can join the session and use the draft driver. While in session, a dedicated driver to that workspace is provided, enabling users to do rapid data exploration and analysis without waiting for resource provisioning. This allows for rapid prototyping with business users. Resources created in the draft driver are marked as draft and present only in memory. Once prototyping is done, the dataset can be operationalized and materialized. Upon materialization, they persist and become available as regular resources.
DataApps are managed with the same rigor as other software applications. Versioning ensures their stability and manageability. Once a resource such as a workspace, dataset, model, or report is assigned a version, consuming applications can be guaranteed a consistent interface.
Users can experiment endlessly in draft mode. Once they like the output, they can publish the findings/resources with a version. Any further changes result in a newer version.
Versioning allows for experimentation and stability while building data applications.
All the resources used to build DataApp need updates. Therefore, resources have an update_policy, and policies can be OnDemand, OnEvent(*), OnSchedule, or Rules.
These policies allow users to configure a flexible and optimal way to reflect data updates. Resources can use rules to model dependencies across different resources; for example, the user might want to update a dataset only after all the upstream datasets are updated, with each dataset potentially having a different update frequency.
The synchronization process is triggered when a data source is updated. The system notifies all the dependent resources, which take the appropriate action based on the update_policy settings.
As the complexity increases due to the scale and variety of operations, manually reviewing the application for exceptions is unwieldy and potentially error-prone. Observability makes it manageable; the system detects anomalies based on rules and machine learning. Users can define alerts based on data updates, threshold rules, or anomalies.
Users can monitor datasets, models, and reports by authoring alert rules. Alert rules are of the following types:
Threshold-based - for example, if the revenue > a value, please notify the user/user group. Update-based - if the underlying resource, such as a dataset, is updated, the user/user group subscribing to the alert rule is notified.
Anomaly-based - machine learning algorithms detect anomalies and notify the subscribers.
Read more: Observability
Governance involves managing data used on the platform throughout its lifecycle, maintaining its value and integrity. It ensures the data is complete as required, secure, and compliant with the relevant regulations with an audit trail of activities. It provides accurate and timely data for informed decisions.
Read more: Governance (coming soon).
LoginOne can log in to the platform by clicking on https://xui.xvectorlabs.com/
Users must enter their email and password on the login page and click the Login button. After logging in, the home page displays a list of resources available by default or shared by other users. For a first-time user, this list will be empty. To start, click the Add button at the top right corner of the page to create a workspace.
It is recommended to go through the documents in the Concepts section to understand the different resources and then build an app in the created workspace.
The App Store contains publicly available workspaces. Users can use these already-created apps to accelerate their process. Users can also publish their workspaces as Apps.
All available apps can be accessed by clicking on ‘Apps’ present on the home page.
Name - Name of the workspace
Description - write a description of the workspace
Once in the Workspace, xVector provides a list of options on the top right of the screen.
In today's data-driven world, enterprise data is scattered across diverse landscapes - files, databases, object stores, cloud warehouses, and APIs embedded within various applications. At xVectorlabs, transforming this fragmented information into actionable insights begins with Data Sources.
Data Sources are the gateway, allowing users to connect to, import, and synchronize data from multiple origins. Whether the data resides in structured files, dynamic APIs, or sophisticated cloud storage systems, users can configure and execute a connector to bring it into xVectorlabs as a data source. A rich catalog of connectors, periodically updated by xVectorlabs, ensures compatibility with an ever-expanding array of systems. Missing a connector? Users can reach out to connectors@xvectorlabs.com, and a new one can be developed quickly to meet their needs.
Once connected, the process doesn’t stop at simply importing data. Updates from source systems are seamlessly UPSERTED, reflecting real-time changes while preserving the historical timeline of values. Bulk data import is supported with the OVERWRITE option. This meticulous synchronization ensures traceability, enabling businesses to trust the integrity and provenance of their data.
xVectorlabs simplifies the journey from raw data to actionable insights, offering users the tools to acquire, refine, and analyze data confidently. With regular updates to its connectors catalog and robust metadata management, the platform ensures that businesses can harness the full potential of their data ecosystem - turning scattered information into cohesive narratives that drive impactful decisions.
Available connectors based on type
There are some settings for data sources that are very useful and necessary. Read on to find out more about them.
xVector automatically infers metadata using sampling techniques while creating a datasource. It is recommended that the metadata be reviewed carefully and any corrections and changes made if required. Metadata setting is a crucial step.
Available settings in metadata:
The same name that is present in the source datasource
The new name that is given to a column in the dataset when it is copied from the datasource
Enter any description for a column. This description will be available to other users with access to the datasource and get copied into datasets created from this data source.
xVector automatically infers the data type of the column using a sampling technique. It is recommended that the data type be reviewed for any potential errors. xVector infers int, float, string, and date-type columns automatically.
The format will be applicable for data types such as datetime and currency. For datetime data, one can choose from different format options like ‘YYYY-MM-DD’, ‘DD/MM/YYYY HH:MM: SS,’ etc. The data will not be changed; this setting is only for the visualization perspective.
You can choose an appropriate semantic type for the column. For example, if the column's data type is int, then semantic types such as SSN, zip code, etc., will be available. These settings are used in visualization.
Choose from the options provided. This is again used for visualization and modeling purposes. In xVector, metadata is used extensively throughout the platform.
It is recommended that the default setting be kept. If the SKIP HISTOGRAM is false (default), xVector generates a histogram for the column. The other profile information is generated. You can set it to True cases where the cardinality of the column is very large and may result in computation load.
The default setting is NULLABLE and is set to TRUE. If the column value can not be null, change the setting to FALSE. Turning the slider off will set it to FALSE. xVector will throw a warning if a non-nullable column is found to have null values during the connection process.
Set the column as either a dimension or a measure. Again, this information is used in visualization and modeling.
Set the column as either a dimension or a measure. Again, this information is used in visualization and modeling.
Profile
Data profiling involves examining data to assess its structure, content, and quality. This process calculates various statistical values, including minimum and maximum values, the presence of missing records, and the frequency of distinct values in categorical columns. It can also identify correlations between different attributes in the data. Data profiling can be automated by setting the profile option to true or performed manually as needed.
A CSV-type datasource is used when one needs to bring CSV data from a local system or machine. Follow the below process for creating this datasource-
2. Choose the CSV data source type from the available list.
3. Go through the steps:
a. Configure
b. Advanced
c. Preview Data - display a sample of data
d. Column Metadata - view the automatically inferred metadata and modify it if needed.
Scrolling to the right gives the following:
e. Table Options -> (Record Key, Partition Key, ...)
4. Click on Save
An S3 datasource is used to bring data from the AWS S3 bucket. One needs to have an AWS access key and a secret key for getting the data from the S3 bucket.
Follow the below process for creating this datasource-
2. Choose the S3 data source type from the available list.
3. Go through the steps:
a. Configure
b. Advanced
c. Preview Data - display a sample of data
d. Metadata - view the automatically inferred metadata and modify it if needed.
Scrolling to the right gives the following screenshot:
e. Table Options - Record Key, Partition Key
4. Click on Save
A status screen will come up. One can track the status of the process. On completion, it will redirect to the workspace page.
Data is the foundation of every insightful analysis, yet its raw form often lacks the structure and clarity needed for effective decision-making. Within xVectorlabs, raw data enters the system from various sources like transactional databases, APIs, logs, or third-party integrations. However, before it can be leveraged for exploratory analysis, machine learning, or operational reporting, it must be refined and structured into a more usable format.
This journey from raw data to a structured dataset involves multiple steps, ensuring accurate and insightful information. The process begins with ingestion, where data is pulled into the system (as seen in the Datasource document), followed by profiling and metadata definition, which help uncover patterns, inconsistencies, and key attributes. Following this, enrichment techniques enhance the data, making it more relevant and applicable for downstream applications.
As data flows into the system, it becomes a Dataset - a structured, enriched version ready for analysis and modeling. This transformation begins with profiling, where metadata is extracted, and each column is classified based on its data type, statistical type (categorical or numerical), and semantic type (e.g., email, URL, or phone number). This metadata provides context, shaping downstream processes like exploratory analysis or machine learning model training.
The enrichment process brings data to life. Users can leverage an extensive toolkit to:
This step bridges the gap between raw data and actionable insights, enabling users to prepare datasets that fuel the development of robust models and their real-world applications.
The clear distinction between a Data Source and a Dataset isn’t just about workflow organization - it’s about traceability and trust. By maintaining the data lineage, users can always trace back to the source, ensuring transparency in processing and confidence in decision-making.
Building and Enriching Datasets
Users can create datasets from the data sources; metadata is copied along with the data. They can apply various transformations to the data, and the resulting data is persisted with an appropriate policy, such as OVERWRITE or UPSERT. Users can define the pipelines based on their requirements. If they need to access the update timeline, UPSERT would be an appropriate update policy. Furthermore, users can set up synchronization properties that suit the use case, such as on-demand, schedule, or event.
A dataset is derived from a data source and transformed to meet the needs of the DataApp. Data sources are immutable, establishing the provenance of various computations to the source systems.
Users enrich the dataset using a series(flow) of functions (actions) such as filters/aggregates and filters. They can also apply trained models to compute new columns. For example, they can use a classifier to identify customers who churn to the latest order data. Users validate and save the logic. Once saved, they can apply the changes to the original dataset and materialize the new dataset on the disk using the Materialize action. Users must be cognizant of actions that change the dataset's structure, such as joining or aggregating; if the actions change the structure, OVERWRITE is the recommended update policy.
While enrichment functions and our generative AI agent can suffice for business analysts, data engineers might prefer to transform data programmatically. Custom functions allow advanced users to author functions to alter the dataset programmatically.
Or choose Create Dataset from the menu options of a Datasource.
2. Configure
Datasets are of the following types,
3. Advanced
4. Select Workspace - select a workspace for the dataset.
Once the dataset is created, there are several options to choose from to explore and enrich the dataset. These options can be accessed from 2 different places. One from the workspace and the second from the dataset page. Below are the details.
Click on the vertical ellipses or kebab menu of the dataset tile in the workspace. Following are the options:
View
To view the data. Takes you to the dataset screen.
Update Profile
To update profile. This will need to be run each time a dataset is created or updated to view the profile
View Profile
To view the profile of the data. This can be viewed only after the “update profile” is run at least once after creating or updating the dataset.
Create copy
Creates a copy of the data
Generate Report
Generates an AI powered report dashboard. This can be used as a starting point by the user and reports can then be edited.
Generate Exploratory Report
Generates an AI powered report to explore the dataset.
Generate Model
Generates an AI powered model by choosing the feature columns and automatically optimizing the training parameters based on your prompt. This can used as the starting point and the user can update parameters/metrics to generate more experiments/runs.
Activity Logs
Shows the list of activities that occurred on that dataset
Map Data
In the Workspace, click on the ellipses on the dataset tile (tile with the green icon) to see the following options:
This is to do the mappings of column metadata from source to target (used in data destination)
Sync
Synchronizes the dataset with the data source.
Update
Updates the data as per source
Materialize
To persist the dataset in the filesystem
Publish
To publish the data. It will assign a version to it and once published will have a guaranteed interface.
Export
Writes the data to a target system (points to data destination)
Settings
Opens the settings tab for the dataset
Delete
Option to delete the dataset
Double click on the dataset tile in the workspace to get to the dataset page or click on the view option on the vertical ellipses of the dataset tile in the workspace. This will take you to the dataset view page.
Below is the screenshot of the dataset view page and the description of all the features that appear on the top right of the page starting from left to right.
Presence
Shows which user is in the workspace. There could be more than one user at a given time.
GenAI
The ability for users to ask data-related questions in natural language and get an automatic response. For example, the user could ask the question, “How many unique values does the age column have?”. GenAI would respond with the number of unique values for that column.
Driver (play button)
Starts or shutdown dml driver for the dataset
Edit Table
Ability to search or filter each of the columns in the dataset
Data Enrichment
Option to add xflow action or view enrichment history. A more detailed description of each of the enrichment functions is below here.
Profile and Metadata Report
Data profiling involves examining data to assess its structure, content, and overall quality. This process calculates various statistical values, including minimum and maximum values, the presence of missing records, and the frequency of distinct values in categorical columns. It can also identify correlations between different attributes in the data. Data profiling can be automated by setting the profile option to true or performed manually as needed.
Write back
The user can write the data to a target system
Reviews and Version control
Ability to add reviewers and publish different versions of the resource
Action Logs
Shows the logs of action taken on that dataset
Alerts
Option to create, update, or subscribe to an alert. A more detailed description of alerts can be found here.
Comments
Users can add comments to collaborate with other users.
Settings
The dataset is enriched using a series(flow) of functions such as aggregates and filters. Users can also apply models trained to compute new columns. Advanced users can author custom functions (*) to manipulate the data.
Or
The following sections will describe each of the enrichment functions.
This action is used to perform the calculation on a set of values and returns a single value such as SUM, Average, etc.
Follow these steps to perform aggregate:
Example:
We will aggregate German Credit Data. We want to get the sum of credit and total duration grouped by different columns - (risk, purpose, housing, job, sex)
The “author” function can be used when writing a custom function to be applied to a dataset.
Follow these steps to perform the author function:
Example:
For example, we will use bank marketing campaign data. We want to categorize the individuals into three groups - student, adult, and senior depending on age. For this, we will author a custom function and run it on the dataset.
This action is used to change the data type of a column in the dataset.
Follow these steps to perform data type change:
Example
For example, we will use autoinsurance_churn_with_demographic data. There is a column “Influencer” as a String and we will update this to an integer.
This action is used to delete rows of a dataset based on some condition.
Follow these steps to delete rows -
Example
We have one dataset - medical_transcription_with_entities with column ‘predicted_value’. We will delete rows having an empty list ([]) in the predicted_value column.
This option is used to remove columns from the dataset.
Follow these steps to delete rows -
Example
We have one dataset - ‘datatype update on autoinsurance_churn with demographic’. We will delete columns - ‘has_children’ and ‘length_of_residence’ from this dataset.
This option is used to remove missing values from the dataset.
Follow these steps to delete rows -
Example
We have one dataset - ‘datatype update on autoinsurance_churn with demographic’. We will delete rows where - the ‘latitude’ and ‘longitude’ column values are not present.
Explode function is useful to convert an array/list of items into rows. This function increases the number of rows in the dataset.
Follow these steps to perform explode:
Example
We have one dataset - medical_transcription_with_entities with column ‘medical_speciality’. We will perform an explode action on medical_speciality to extract values from the list. This will increase the number of rows.
This action is used to replace null values in a dataset with a specified value.
Follow these steps to perform fillna-
Example
We have autoinsurance_churn_data joined with demographic data. This has a City column with null values. Let’s replace the null value with the string - ‘Not Available’.
The filter function is used to extract specific data from a dataset on a set of criteria.
Follow these steps to perform the filter:
Example
We will use Walmart Sales data to perform filter operation. We will filter records for Store 1.
This action is used to join two datasets based on a particular column.
Follow these steps to add a new column:
Example
For example, we will be performing a join operation on the autoinsurace_churn dataset with the individuals_demographic dataset. Here, the left dataset is autoinsurance_churn, the right dataset is individuals_demographic and the column for joining is individual_id present in both datasets.
This action is useful to extract data from JSON objects to structured data as columns.
Follow these steps to perform json_normalize:
Example
For example, we will be performing json_normalize on Clothing E-commerce Reviews data. This dataset contains JSON data in the predicted_value column. We will extract ‘polarity’ and ‘sentiment’ keys data from the JSON data to create new columns.
This action is used to apply a trained model to the dataset. For this function, the trained model needs to be deployed and running. This produces a new column with values predicted by the model.
Follow these steps to perform the model apply:
Example
We will perform model_apply on Clothing e-commerce review data. A model for analyzing sentiments has already been deployed, and we will use this to predict the sentiments on the reviews data.
This action creates a new column in the dataset based on the provided expression. One can create a new column based on the values of other existing columns.
Follow these steps to add a new column:
Example
For example, we are using German_Credit_Dataset. This contains a column ‘Age’ which is numerical. We will create a new column ‘Age_Category’, where we categorize individuals as
Student (age < 18), Adult (18 < age < 60), and Senior (age > 60) using the ‘Age’ column.
A pivot function is a data transformation tool used to reorganize data in a table from rows to columns. This function requires selecting a pivot column (categorical column), value columns (numerical column), and an aggregate function. The pivot column becomes the basis for the new rows in the pivoted table.
Follow these steps to perform the pivot-
Example
For example, we will apply pivot action on trend chart data. Here, we want to know the total volume of products in different regions grouped by company and brand.
The Split column function can be used to split string-type columns based on a regular expression.
Follow these steps to perform the split:
Example
For example, we will be performing a split action on autoinsurace_churn with demographic data. It contains a column ‘home market value’ which shows a range of values as a string (2000 - 3000). We will split this on ‘-’ to get lower range and higher range values.
Union is used to add rows to the dataset
Follow these steps to perform union-
Example
We are applying union on sales data for store 5 and appending records from sales data for store 1. This will result in a dataset with sales data for stores 1 and 5.
A pivot function is a data transformation tool used to reorganize data in a table from rows to columns. This function requires selecting a pivot column (categorical column), value columns (numerical column), and an aggregate function. The pivot column becomes the basis for the new rows in the pivoted table.
Unpivoting is the process of reversing a pivot operation on data. It takes data that's been summarized into columns and spreads it back out into rows.
Follow these steps to perform unpivot:
Example
For example, we will apply to unpivot action on pivoted trend chart data.
Upsert is used to add rows that are not duplicates to the dataset. The difference between upsert and union is that union appends all the rows from the new dataset to the existing dataset. Upsert, on the other hand, will append only those rows that are not already there in the existing dataset.
Follow these steps to perform union-
Example
We are applying upsert on sales data for store 5 and appending records from sales data for store 1. This will result in a dataset with sales data for stores 1 and 5.
This action is used to perform statistical operations such as rank, row number, etc. on a dataset and returns results for each row individually. This allows a user to perform calculations on a set of rows preceding or following the current row, within a result set. This is in contrast to regular aggregate functions, which operate on entire groups of rows. The window is defined using the over clause (in SQL), which specifies how to partition and order the rows. Partitioning divides the data into sets, while ordering defines the sequence within each partition.
Follow these steps to apply the window function:
Example
We will consider Walmart's Weekly Sales data to perform the window function. In this, we want to know the sum of weekly sales for different Holiday Flag, ordered by Date considering the window starting from the first row to the current row.
SQL Editor: Use this to write your own queries.
The following are common options for different types of enrichment functions:
Users can set up alerts based on rules for thresholds or drifts.
Create an alert with the following steps:
Drafts are calculated in the context of models. They are calculated when the dataset is synchronized with the Data Source. The data source should have indices for synchronization.
This is explained further in the Models section.
Models allow users to find patterns in data and make predictions.
Regression models, a supervised learning technique, allow users to predict a value from data. For example, given inventory, advertising spending, and campaign data, a regression model can predict a lift in sales.
Classifiers can help identify different classes in a dataset, an example being the classification of customers who will churn based on order history and other digital footprints left by the customer.
Clustering models enable users to group/cluster based on different dataset attributes. Businesses use clustering models to understand customer behavior by finding different segments based on purchasing behavior.
Time series analysis allows businesses to forecast time-dependent variables such as sales, which helps manage finance and supply chain functions better.
Natural language processing (NLP) and large language models (LLM) can extract entities, identify relevant topics, or understand sentiment in textual data.
Given the sensitivity of these algorithms to the data. The distribution underlying the data can change over time, which might lead to performance deterioration of a specific algorithm; we need a mechanism that allows for selecting the best algorithm.
In the world of xVectorlabs, building models begins with creating drivers. These drivers are the foundation - powerful libraries such as Scikit-learn/xgBoost power the algorithms. Once the driver is crafted, the next step is to create a model, a framework ready to be brought to life with experimentation.
Each model becomes a hub of exploration, where experiments are authored to test various facets of the algorithm. These experiments are, in turn, populated with multiple runs, each a unique attempt to capture and compare the parameters that drive the algorithm's behavior. Imagine tweaking the settings of a regression model—adjusting its learning rate or altering its input features - and observing how these changes shape its performance.
This structured approach organizes models as a composition of drivers, experiments, and runs, creating a seamless flow for experimentation. It allows users to adapt, learn, and refine their models precisely, uncovering insights and pushing the boundaries of their algorithms' achievements.
Once the user picks a model that fits the data best, the model can be deployed to make predictions. Models in production are then continuously monitored for performance. Anomalous behaviors are quickly identified and notified for further action.
The platform provides a comprehensive set of model drivers curated based on industry best practices. Advanced users can bring their algorithms by authoring custom drivers.
A user can create multiple experiments under a model. An experiment includes one or more runs. Under each experiment, various parameters with available drivers can be tried on different datasets.
On updating any input parameters and triggering “re-train”, a new run under that experiment gets created.
Different runs under an experiment can be compared using selected performance metrics.
One can view a list of all experiments on the experiment view page that opens on viewing a model.
Experiments can create multiple runs with different input parameters and performance metrics as output. Based on the metric, one can be chosen for the final model.
This aims to enable a user to experiment with different model drivers, datasets, and parameters to achieve the expected performance metric for a model before deploying.
One can view a list of all runs on the run view page that opens on viewing an experiment.
Example
There are options at the end of each Run which are described below (icons from left to right):
View
Drifts
Performance
Runs under experiment(s) are powered by underlying libraries and algorithms defined in model drivers. For example, a statistical or machine learning library such as Sci-kit Learn for a regression model can be used. The platform provides a comprehensive set of model drivers for business analysts and advanced users to analyze. In addition, a data scientist can author custom drivers.
Users can author custom drivers for different model types like regression, classification, clustering, time series, etc. If the requirements do not fit into any one type of model, users can choose the ‘app’ type to author their custom driver.
Creating a new driver
Follow the following steps for authoring a new driver:
.
A task will be triggered to start a Jupyter server for authoring drivers. Once resources are allocated, click on the Launch Jupyter Server button. This will open a Jupyter notebook where users can author drivers.
In the notebook, five files that are mentioned below will be present:
Users need to write the algorithm in the train.ipynb file with the provided format. Also, modify the predict.ipynb file as required. This will be used in getting predictions. To define training or input parameters, open the config.json file in edit mode and write in the format provided.
Options on notebook-
An xVector option is present in the menu bar. Use this for different actions-
Select dataset - click on it and select the dataset for the driver. It will update the config.json file with the metadata of the selected dataset.
Register - Once the driver is authorized (train.ipynb, predict.ipynb, and config.json files have been modified correctly; this must be registered to make this driver available for use in models. Before registering, make sure all five files are present with the same name as provided.
Shutdown - this is to shut down the running Jupyter server.
Below are some models that can be created in xVector.
Regression is a set of techniques for uncovering relationships between features (independent variables) and a continuous outcome (dependent variable). It's a supervised learning technique, meaning the algorithm learns from labeled data where the outcome is already known. The goal is to use relationships between features to predict the outcome of new data.
Follow the below steps to create a regression model
This will start the model creation process (allocating resources, training the model, and saving output). This results in a model with experiment(s) and run(s) under it. Users can view model runs and performance metrics once the training is complete.
Example
We will train a model on Weekly Sales Data to understand the relationship between the weekly sales and other columns. This trained model can then be used to predict sales, given the values for input columns on which the model has been trained.
Click on Add and choose Models.
Choose Regression
Model Tab
Configure Tab
Features Tab
Parameters Tab
Advanced Tab
Classification categorizes data into predefined categories or classes based on features or attributes. This uses labeled data for training. Classification can be used for spam filtering, image recognition, fraud detection, etc.
Follow the below steps to create a regression model
This will start the model creation process (allocating resources, training the model, and saving output). This results in a model with experiment(s) and run(s) under it. One can view model runs and performance metrics once the training is complete.
Example
For example, we will train a classifier model on the auto insurance churn dataset to find whether a given individual with details will churn.
Click on Add and choose Models
Choose Classification
Model details
Configure details
Features Details
Parameters details
Advanced details
Clustering is an unsupervised learning technique that uses unlabeled data. The goal of clustering is to identify groups (or clusters) within the data where the data points in each group are similar and dissimilar to data points in other groups. It can be used for customer segmentation in marketing, anomaly detection in fraud analysis, or image compression.
Follow the below steps to create a clustering model
This will start the model creation process (allocating resources, training the model, and saving output). This results in a model with experiment(s) and run(s) under it. One can view model runs and its performance metrics once the training is complete.
Example
For example, we will train a clustering model on the online retail store dataset to identify and understand customer segments based on purchasing behaviors to improve customer retention and maximize revenue.
Click on Add and choose Models.
Choose Clustering
Model details
Configure details
Features details
Parameters details
Advanced details
Time series analysis is a technique used to analyze data points collected over time. It's specifically designed to understand how things change over time. The core objective of time series analysis is to identify patterns within the data, such as trends (upward or downward movements), seasonality (recurring fluctuations based on time of year, day, etc.), and cycles (long-term, repeating patterns). By understanding these historical patterns, time series analysis can be used to forecast future values. This is helpful in various applications like predicting future sales, stock prices, or energy consumption.
Follow the below steps to create a time series model
This will start the model creation process (allocating resources, training the model, and saving output). This results in a model with experiment(s) and run(s) under it. One can view model runs and performance metrics once the training is complete.
Example
For example, we will train a time series model using Weekly Sales Data for Store_1. This takes the date column as input and forecasts weekly_sales.
Click on Add and choose Models.
Choose Timeseries
Model details
Configure details
Parameters details
Advanced details
Sentiment Analysis is the process of computationally identifying and classifying the emotional tone of a piece of text. It's a subfield of natural language processing (NLP) used to understand the attitude, opinion, or general feeling expressed in a text.
Follow the below steps to create a sentiment analysis model
This will start the model creation process (allocating resources, training the model [if the driver is not pre-trained], and saving output). This results in a model with experiment(s) and run(s) under it. One can view model runs and performance metrics once the training is complete.
Example
We will create a model to find reviews' sentiment in Clothing E-Commerce Reviews data. For this, we are using a pre-trained driver.
Click on Add and choose Models
Choose Classification
Model details
Configure details
Features Details
This step is skipped for sentiment analysis
Parameters Details
Advanced details
Entity Recognition is a sub-task within Natural Language Processing (NLP) that identifies and classifies essential elements in text data. It helps identify key information pieces like names, places, organizations, etc. NER automates this process by finding these entities and assigning them predefined categories.
Follow the below steps to create an entity recognition model
This will start the model creation process (allocating resources, training the model, and saving output). This results in a model with experiment(s) and run(s) under it. One can view model runs and performance metrics once the training is complete.
Example
For example, we will create a model to extract named entities from medical transcription samples.
Click on Add and choose Models.
Choose Entity Recognition
Model details
Configure details
Features Details
This step is skipped for entity recognition
Parameters Details
Advanced details
Topic modeling helps analyze extensive collections of text data to discover hidden thematic patterns. The algorithm scans the text data, looking for frequently occurring words and phrases that appear together. These word clusters suggest thematic connections.
Follow the below steps to create a sentiment analysis model
This will start the model creation process (allocating resources, training the model, and saving output). This results in a model with experiment(s) and run(s) under it. One can view model runs and performance metrics once the training is complete.
Example
For example, we will use Clothing E-Commerce review data and create a model to extract topics from the reviews.
Click on Add and choose Models.
Choose Topic Modeling
Model details
Configure details
Parameter details
Advanced details
To update the details of a model, one needs to open the settings pane for the model, modify the data, and click ‘Save. ’ The settings pane can be opened by selecting the ‘Settings’ option available in the menu options of model cards in the workspace.
A model can be deleted by clicking on the ‘Delete’ in the menu options of model cards in the workspace.
Reports transform data into stories. They offer a dynamic canvas where one can visualize and explore information, crafting interactive dashboards that bring data to life. Businesses can collaborate effectively and uncover deeper insights by delving into models, slicing and dicing data, and drilling into details. These reports serve as a bridge for data exploration and communication, helping stakeholders align their understanding of analytical facets.
Users organize their stories in sheets. Sheets comprise various visual and interactive components to enable a rich collaborative experience. In addition, collaboration can be extended within and beyond the organization by embedding the reports in multiple applications.
Users can analyze their data with various key widgets, such as:
Exploring data is dynamic, and slicing and dicing are core components. By breaking data into smaller segments, users can uncover patterns, trends, and insights that would otherwise remain hidden. Whether filtering by time, geography, products, or channels, this capability enables users to focus on what truly matters. Filters further enrich this process by providing precise control:
Interactivity empowers users to engage with their data in meaningful ways. Features like Column Strips, Date Aggregator Strips, and Input Values further enhance the interactive experience by introducing flexibility and customization:
Reports become more than just data displays when they provide context-rich narratives. Features like the Link Component ensure seamless navigation between components while preserving context. Drill-down charts empower users to dig deeper into aggregate data, and right-click menus streamline the exploration of related insights, offering intuitive pathways to uncover hidden stories.
In xVectorlabs, reports aren’t just static outputs; they are evolving stories, empowering users to craft compelling narratives, uncover hidden insights, and foster collaboration that drives impactful decisions.
Reports are used for data exploration and creating visualizations.
Reports help get a better understanding of the data and build dashboards to gain insights into business goals and operations.
Users organize their stories in sheets. Sheets comprise various visual and interactive components to enable a rich collaborative experience. Users can create multiple sheets in a report; each sheet can consist of components like line charts, pie charts, data filters, etc. Users can rename and rearrange created sheets.
Drivers are needed to view and interact with live data on report components. All reports use a default driver unless configured to use a dedicated driver.
The layout allows users to choose canvas sizes from available options, or they can also create custom ones. One can set the width and height of the canvas. The layout also provides an option for enabling snapping that helps arrange the components in a sheet.
Themes make it easy to set several settings for the report. One can choose from a list of themes or edit them according to their requirements.
From a workspace, click on Add and choose Reports
Click on the +Component button and choose the component type from the available options.
Available components:
Each component below will use terms like measures and dimensions, which are explained in the Datasource document.
A scorecard is a visual summary of key performance indicators (KPIs) that helps stakeholders quickly assess performance against goals. It provides a way to monitor progress over time, identify trends, and identify areas where adjustments may be needed.
Typically, scorecards have primary and secondary metrics that are tracked, which are described below:
Primary Metrics: The most important metric(s) that directly measure success or failure for a given objective.
Secondary Metrics (Supporting Indicators): Additional metrics that provide context or explain trends in the primary metric. They help understand why performance is changing.
Follow these steps to create a scorecard -
Example
We will create a scorecard on Weekly Sales Data. We want to check the total sales for a given period and compare them against the previous period.
Created Scorecard
Steps to create a scorecard:
Click on Add Component (➕)
Choose Scorecard
Enter Details in the data and format tab
Data tab
Format tab
Format tab configurations
Charts help in visually representing the data which helps in understanding trends, patterns, and relationships in the data. They are like mini-reports that use bars, lines, and pies, to convey information visually.
Types of Charts
A line chart is used to display trends between continuous numeric values.
Follow these steps to create a line chart -
Example
We are using weekly sales data to create a line chart. This is to analyze changes in weekly sales with changes in fuel prices.
Created Line chart
Click on Add Component (➕) and choose chart.
Data Tab
Format Tab
This is used to represent data by rectangular bars. The length or height of each bar is proportional to a value associated with the category it represents.
Follow these steps to create a bar chart -
Example
We are using weekly sales data to create a bar chart. This is to visualize total sales by different stores.
Created bar chart
Data Tab
Format Tab
Combo charts combine line and bar chart types into a single view. It allows you to display different aspects of your data simultaneously, helping you identify trends, relationships, and insights that might be missed in separate charts.
Follow these steps to create a combo chart -
Example
For example, we are using bank_marketing_campaign_data. We want to visualize how a call's average balance and total duration vary with age.
Created combo chart
Data Tab
Format Tab
Time series charts are a type of line chart designed to visualize trends and patterns over time.
Follow these steps to create a time series chart -
Example
We select Weekly Sales Data to create a time series chart. Here, we want to visualize change in average weekly sales over a given period and compare this with the previous period's sales
Created time series chart
Data Tab
Format Tab
An area chart uses both lines and shaded areas to depict how data points change over time or another numeric variable. Area charts are well-suited for showcasing trends and visualizing the accumulation of values over time.
Follow these steps to create an area chart -
Example
We are using bank marketing campaign data to create an area chart. This is to analyze the total yearly balance distribution by age.
Created an area chart
Data Tab
Format Tab
A bubble chart is used to represent three dimensions of data using circles (bubbles). Two variables determine the bubble's position on the x and y axes, and the third variable specifies the size of the bubble.
Follow these steps to create a scatter plot -
Example
For example, we use the ‘German Credit data categorized by age’ dataset. Here, we will see the duration of the call, which will depend on the purpose and credit amount.
Created Bubble Chart
Data Tab
Format Tab
A pie chart is a circular chart representing portions of a whole. Pie charts work best for categorical data, where the data points fall into distinct groups or slices.
Follow these steps to create a scatter plot -
Example
We will create a pie chart using bank marketing campaign data. This is to visualize the number of term subscriptions by job types.
Created pie chart
Data Tab
Format Tab
This presents information in a grid format with rows and columns, making it easy to analyze multiple data points for various categories.
Follow these steps to create a table -
Example
We will create a table from Online Retail Data. This is developed to analyze the average quantity of products and total customers present in different countries.
Created Table
Click on Add Component (➕) and
Choose Table
Data Tab
Format Tab
It is used to summarize and organize the data in a way that is easier to understand. One can define which data goes into rows and which goes into columns in the pivot table. For example, one could see total sales by product (products in rows) or by region (regions in rows).
Follow these steps to create a pivot table -
Example
We will create a table from Online Retail Data. This is designed to analyze how the total price of products has changed over time (InvoiceDate) for different countries.
Created Pivot Table
Click on Add Component (➕) and
Choose Pivot Table
Data Tab
Format Tab
A funnel chart is divided into horizontal sections, with the widest section at the top and the narrowest at the bottom. Each section represents a stage in a process, like steps in a sales funnel or even the application process for a job.
Follow these steps to create a funnel chart-
Example
Using “Online Retail Data”, we want to see how the total quantity of different products has been distributed in various counties.
Created funnel
Click on Add Component (➕)
Choose Funnel
Data Tab
Format Tab
A treemap displays hierarchical data using nested rectangles. It is handy for showing part-to-whole relationships and identifying how different categories contribute to a larger whole.
Follow these steps to create a tree map:
Example
For example, we will use the ‘auto insurance churn’ dataset to create a treemap to understand the total income in a hierarchical order of state, county, and city.
Created treemap
Click on Add Component (➕)
Choose Treemap
Data Tab
Format Tab
Sankey depicts flows between different stages or categories. It uses arrows to represent these flows, with the arrow's width corresponding to the flow's magnitude.
Follow these steps to create a treemap-
Example
Using auto insurance churn data, I created a Sankey diagram to visualize the income flow to different counties based on marital status.
Created Sankey
Click on Add Component (➕)
Choose Sankey
Data Tab
Format Tab
It is used to represent hierarchical data in a circular structure.
Follow these steps to create a sunburst-
Example
We will use the ‘auto insurance churn dataset. ’ We want to view how curr_ann_amt is distributed among different counties with churn and marital statu.s
Created Sunburst
Click on Add Component (➕)
Choose Sunburst
Data Tab
Format Tab
This helps in adding an image to the report sheets.
Follow these steps to create an image-
Example
Created image
It is used to add some text fields in the report sheets.
Follow these steps to add a text component-
Example
Created text
Configuration Tab
Link provides an option to open another sheet or report.
Follow these steps to add a link component-
Example
For example, creating a link component in the report for Clothing E-commerce reviews with sentiment data. Added a link component to show the model used for extracting topics for the reviews data.
Created Link
Click on Add Component (➕)
Choose Link
Data Tab
Format Tab
This is used to organize different report components into tabs.
Follow these steps to create a treemap-
Example
We will use Weekly Sales Data to create a tab component. We have already made three components - one line chart and two bar charts. Then, make a tab component with three tabs. Drag the already-created charts into the tabs one by one.
Created Tabs
Data Tab
Format Tab
We will now look at a few model visualizations that can be created in xVectorLabs:
Regression:
We will create "Sensitivity Analysis" and "What If Scenarios" within the Regression model.
Also, exploring the Regression model via a business case can be found here.
Time Series:
exploring the Time Series model via a business case can be found here.
A trained model takes multiple feature values (column values) and produces an output.
Sensitivity analysis enables users to analyze the output of a model by varying values of feature(s).
Users can vary one feature/column over a range and get the list of predicted values.
Sensitivity analysis enables users to vary two features/columns over a range and get a table of predicted values.
A model needs to be deployed first to use sensitivity analysis. This model will be used to get the predictions.
Follow these steps to create a sensitivity analysis component
Example
We will use the Sales Prediction model for sensitivity analysis. For this, we vary two numerical features (or columns), Fuel Price and CPI, over 80% to 140% to analyze how the sales are impacted. The selected 80% to 140% will vary by steps of 20% when the selected step is 0.2. This shows how sensitive sales are to fuel price or CPI, which are the input features.
One thing to note is that the model needs to be deployed to create the report.
Created sensitivity analysis component
Click on Add Component (➕)
Choose Model Components -> Sensitivity analysis
Data Tab
Format Tab
A trained model takes multiple feature values (column values) and produces an output. A user can provide a specific value to each feature/column and get the model output using the What-If Scenario.
A model needs to be deployed first to use sensitivity analysis. This model will be used to get the predictions.
Follow these steps to create a sensitivity analysis component
Example
In this example, we predict sales price when we change the value of a particular feature like fuel price, holidays, or stores.
The SHAP image shows the importance of the features.
Created What If Scenario component
Click on Add Component (➕)
Choose Model Components -> What If Scenario
Data Tab
Format Tab
This will create a rectangle shape on the sheet.
Follow these steps to create a rectangle shape
Created Rectangle
This will create a circle shape on the sheet.
Follow these steps to create a circle shape
Created Circle
Applies date ranges by specific dates, months, or years to selected report components.
Follow these steps to create a date range filter-
Example
We are creating a date range filter on a line chart component created using Weekly Sales Data. The line chart displays the change in Weekly Sales over Fuel Price. We will apply a date range filter to limit the output to a fixed period.
Created Filter
Data Tab
Format Tab
Filters data by various entities such as products, channels, and geography. Applying entity filters on report components that use entity-type datasets is recommended.
Entity filters can be applied in view mode.
Follow these steps to create an entity filter:
Example
Using the auto insurance churn dataset, we have created a table with columns - state, county, and city (an entity with hierarchy) and selected income and curr_ann_amount as measures. We now want to apply filters to the entities, which are state, county, and city. To apply the entity filter, we created one filter and mapped the entity column to the table component.
Created Entity
Click on Add Component (➕)
Choose Filters and then Entity
Data Tab
Format Tab
A date aggregator helps users apply date aggregate functions such as day/week/month/quarter for time-based datasets and view the reports with the appropriate level of granularity.
Follow these steps to create a date aggregator-
Example
Using the auto insurance churn dataset, we have created a table with columns - state, county, city, and cust_orig_date- and selected income and curr_ann_amount as measures. We now want to apply date aggregation to filter by Month, Quarter, Year, and Year Month.
Created Date aggregator
Click on Add Component (➕)
Choose Filters and then Date Aggregator
Data tab
Format Tab
This is to view and manage all applied filter values on the current report sheet.
Follow these steps to create a filter collection-
Example
Using the auto insurance churn dataset, we have created a table with columns - state, county, city, and cust_orig_date- and selected income and curr_ann_amount as measures. We have applied entity, date-aggregator, and input slider filters in this table. Creating a filter collection will display all the values the filter applies to the table component.
Created filter collection
Format Tab
This carries over filters from one sheet to another. Select different sheets under the Data tab in the ‘Duplicate in sheets’ section. This will apply the filters to all selected sheets.
Data tab of filters
Dimension allows you to get the value of charts (line/bars) for each category value present in the selected dimension column. It is a way further to group the data by a category of choice
Example
For the bank_marketing_campaign dataset, if one wants to know the total number of individuals with term subscriptions based on job types, one bar chart with the job on the x-axis and y (flag for term deposit subscription) on the y-axis can be created. Furthermore, if one needs to know the distribution of married individuals under each job, ‘marital status’ in the dimension field should be added to visualize the required distribution.
Chart without dimension:
After adding dimension
A variable enables values populated from a dataset or inputs. Custom expressions can use variables in advanced expressions. A variable is either predefined or tied to an input slider's value. It is derived from a dataset using an expression or is user-defined. Variables are scoped at a report and can be used in any sheet belonging to the report in which a variable was created.
Follow these steps to create a variable-
To use the created variable, add a custom-type measure to the required chart. Enter the name of the measure and write the expression using that variable. Variable values can be accessed in expression using @<variable_name>
Example-
Here, we will create a variable named ‘max_sales’ which will store the value of maximum Weekly_sales in Weekly Sales Data
Creating a variable-
From a report, click on add variable icon
Enter details for the variable.
Writing expression for computing variable-
Syntax: COMPUTE(column=column_name, filters=[], order='asc', limit=int)
Column_name: column_name from the selected dataset. Aggregate functions can also be applied on columns like - max(column_name), sum(column_name), and avg(column_name). Supported functions - max(), sum(), avg(), min(), count().
Filters: list of filters. Each filter can be a simple expression such as between(), isin(), notin(), and similarly supported conditionals. Filters can be grouped with brackets. Currently, the AND operator is supported with "&". OR operator is supported with "|".
Example-
COMPUTE(column=colName, filters=[colABC<20, ( between(col123, 20, 50) & isin(colXYZ, options) )])
Aggregate functions take a single parameter, either a column or case statement.
Example - avg(case(col_name > 20, col_name, None))
Using the created variable-
A bar chart to display the weekly sales of each store is created. Now, we will add one more measure in the y-axis where the difference between the max of weekly sales and the sum of weekly sales for that particular store will be displayed. For this, we will use custom measure with an expression-
@max_sales-sum(Weekly_Sales).
Here @max_sales is the created variable and Weekly_Sales is a column in the dataset.
In view mode:
Expressions allow for complex expressions to be authored as a formula. Formulas can use the richness of various functions, enabling more dynamism and richness to the reports.
One can define and use expressions similar to that of variables. The difference is that expressions are executed only when used. Existing variables can be used in other expressions.
Follow these steps to create an expression-
Example
For example, we will create a line chart with expressions using Weekly Sales Data. Suppose we want to display the date on the x-axis and on the y-axis; we need normalized sales data. Let’s assume we need to normalize the weekly sales value by using the below operation-
( (sum(Weekly_Sales) - min(sales)) / (max(sales) - min(sales)) ) * 100
For this, we will define two variables - max_sales and min_sales to store the maximum and minimum sales values. Then, we will create an expression to get the scaling factor part-
100 / (max(sales) - min(sales)).
Created Expression
Now, we can define a custom column using this expression to get the normalized sales data.
Created line chart
Data Tab
Context Strings provide the mechanism to use values from other sibling components. This applies only to the Scorecard component.
This can be used to create a scorecard with values from other created scorecards.
Example
Created two scorecards using the ‘auto insurance churn’ dataset. One displays the total estimated income of the individual (income_sc), and the other shows the total amount the customer paid (amount_sc). Now, we will create a third scorecard that shows the difference between the total income and the total amount paid. We will create a scorecard using a context string that takes values from the other two scorecards.
Context string -> income_sc.primary - amount_sc.primary
Data Tab
Breakdown charts visualize how a more extensive dataset or distribution can be segmented into smaller, more manageable charts. This is used to analyze distribution for different categories.
Example
Using bank_marketing_campaign_data, if we want to know the number of individuals subscribing to a term deposit based on job type and need different plots for the required distribution for various education levels, we can choose the ‘education’ field under the breakdown dimension field. This will create plots for each education type available. One can further format this using the format tab.
Format Tab
The drilldown feature allows one to explore the data in greater detail directly from the report component itself. It takes a high-level overview and lets users dive deeper into specific areas of interest. Charts that have drill-down enabled allow for deep dives into various aggregates.
Available drill-down output options-
To use the ‘Show as table’ option, one needs to define the table columns first in the data tab of the report component.
Example
We have created a bar chart using bank_marketing_campaign data to get the number of term deposit subscribers by different job types.
Now, we want to analyze one of the job categories by marital status. We can do this by right-clicking on the respective bar (blue-collar), choosing the drill-down bar chart, and then selecting the ‘marital’ column.
Output
Users can drill down the generated chart further into multiple levels as required. For example, in the above case, we used to drill down on blue-collar(job) to marital status and can further do it at education and then loan level.
Example:
Show as Table Option
First, define the columns in the data tab. Here, we select marital and education.
Drill down as table output.
One can view all options for the component by clicking on the menu options ( ፧ ) present at the top right corner of each component.