Rapidly build an end-to-end data solution in xVector.
Introduction
We will showcase how to build an end-to-end data anlytics solution in xVector. We will leverage xVector methodology, which is an iterative and collaborative process, to implement this solution.
We are using readily available dataset from Kaggle for this blog. This solution will be available to users of xVector. They can see the collection when they login to xVector. You can copy the collection and play around with datasources, datasets, data enrichment steps, reports, and models. Explore the solution to see the power of xVector platform.
Business Overview
This example uses data from kaggle competition. You can download data from this location.
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Interested in a demo of xVector?
If you are interested to see how xVector can help your enterprise do rapid data exploration and do advanced analytics with ease, then request for a demo and see how xVector can help your enterprise.
Request a demo
Problem statement
Build an end-to-end solution including data enrichment, reports, machine learning models, ultimately deploying the models, reports, and dashboards to be used in enterprise applications.
We will see how xVector platform enables data science process typically followed in the industry.
Implementation
We are following xVector methodology, which emphasizes collaboration and agility, to implement this solution.
Step 1:
Connect to Datasources
First create a Collection by name 'Credit_Card_Fraud_Project', we will save all the resources in this collection
The data is available in CSV format, use CSV connector provided by xVector to connect to the data
Note: Depending on the amount of data, theuploadprocess,profileprocess may take time. We recommend to usenon-file based connectorsif the data is very large (upwards of10 million records)
See the below demo on how to connect to data.
Step 2:
Validate Data Connection
Make sure data connections are successful, check row counts
Verify sample data in the datasource to check if data upload was correct
If text is present then cross check if all the special characters, formatting was loaded properly
Make sure all the required connectors are setup and their life cycle update rules are set properly
You can set data refresh rates appropriately
After verification data connection seems successful.
Step 3:
Ascertain Data Quality
Go through the profile data provided by xVector for all the datasources
It is recommended to go through the profile data of datasets instead of datasources. This is because you can drill down further in datasets, especially for outliers etc.
So do a copy of the datasource to create a dataset.
Review profile information for each column of each datasource
Come up with any data enrichment requirements from the profile information
Understand the outliers
Going forward xVector will be providing a more detailed data quality reports which can be used to evaluate the data quality, watch out for notification regarding this from xVector
Understand the data distribution for the columns
If multiple life cycle updates (data refresh) has happened, then monitor the data quality changes with each data refresh
Observations:
This data appears rather clean. This is potentially because the data is from a competition. Real world data will have several issues. You will have to spend quality time understanding the data quality and coming up with data enrichment requirements.
Class column distribution is very skewed. This column has 99.8% rows 0 values i.e. not fraud, only 0.17% are 1 values i.e. fraud transactions. Essentially 492 records are fraud. Highly imbalanced data from the perspective of machine learning.
Amount column has several outliers, the mean value is 88.35, but maximum value is 25691.16. Upper quartile (q3) value is 77.17. Let us see distribution plot for some of these measures using data visualization
Let us create a simple data understanding report, see the below demo
Step 4:
Data Enrichment
Create required datasets from the datasources
Create required aggregate datasets using 'Aggregate' functions or
Create running totals, partitioned statistics data using 'Window' functions
Create as many datasets as required
Entity datasets to be used as filters in dashboards
Aggregate datasets to be used in reports for fast response times
Time series based datasets to be used in trend analysis
Datasets for training machine learning models
Use as many data enrichment steps as required
Document each data enrichment step using the 'notes' feature
Re-run 'Profile' from settings screen of datasets to re-profile datasets after data enrichment
Changes to data distribution?
Changes to profile information?
Note that there is no limit on number of datasets that you can create, just make sure you name them appropriately so that can refer to them later on
Implementation of Data Enrichment Steps:
Since data is already clean we will not attempt to do any data enrichment
We will handle imbalance problem using the parameters that we pass to machine learning drivers that we are going to use
Just for demonstration purposes we will do scaling of Amount column using 'Scale' data enrichment function in xVector
Note: After adding data enrichment steps to datasets you need to 'materialize' the datasets to make them available to reports. We are trying to automate this step soon. Materializing a dataset is simple, click onSettingsof a dataset > Click on 'Materialize' button. See the screen below.
See the demo below on how to do this data enrichment step.
Step 5:
Data Visualization
Use data visualization to understand the data
Use pivot tables, charts, sankey diagrams
Use entity filters, expression filters
Use all the formating settings to make slick dashboards
If required re-visit step 4 and step 3 based on the data understanding
Implementation:
More charts are added to the earlier created data understanding report
This solution will be available to xVector users as a reference when they signup
Step 6:
Machine Learning Models
Create advanced machine learning models based on your business requirements
Explore and experiment with multiple drivers such as XGBoost, Catboost, Tensorflow
Experiment with multiple features (datasets), parameters
Compare performance acorss drivers, parameters, features
If required re-visit step 4 to enrich and create new features
Use the trained models in step 4 as data enrichment step and build new visualizations in step 5 using the model outputs
Implementation:
Let us build a simple machine learning model to identify fraud transactions
See the below demo on how to build such a machine learning model.
The estimator status will change from 'Training' to 'Trained' as well as in the notifications a new notification will appear once the model training is done
After the model is created to view the metrics, feature importance and other artifacts see the below demo
Step 7:
Share With Teams
Finally share all the resources such as datasources, datasets, reports, models across your organization
Collaborate with your teams to build, explore, enhance the solution
Step 8:
Deploy Models, Visualizations
To deploy a model simply 'select' the estimator that you found to be the best. xVector automatically creates a REST API for that estimator
See the below demo to deploy an estimator as REST API
Sample REST API URL looks like this http://xmodel.xvectorlabs.com/xmodel/f4b3641e60484719b247143f3e7f8feb/predict
Get in touch with xVector Support for more details.
To embed a dashboard or a report in your own enterprises application, just follow the below demo
Subscribe
This website stores cookies on your computer.
These cookies are used to collect information about how you interact with our website and allows us to remember you.
We use this information in order to improve and customize your browsing experience and for analytics and metrics about our visitors both on this website and other media.
To find out more about the cookies we use, see our Privacy Policy.
If you decline, your information won't be tracked when you visit this website.
A single cookie will be used in your browser to remember your preference not to be tracked.