Rapidly build an end-to-end data solution in xVector.
Introduction

We will showcase how to build an end-to-end data anlytics solution in xVector. We will leverage xVector methodology, which is an iterative and collaborative process, to implement this solution.

We are using readily available dataset from Kaggle for this blog. This solution will be available to users of xVector. They can see the collection when they login to xVector. You can copy the collection and play around with datasources, datasets, data enrichment steps, reports, and models. Explore the solution to see the power of xVector platform.

Business Overview

This example uses data from kaggle competition. You can download data from this location.

https://www.kaggle.com/mlg-ulb/creditcardfraud

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Interested in a demo of xVector?

If you are interested to see how xVector can help your enterprise do rapid data exploration and do advanced analytics with ease, then request for a demo and see how xVector can help your enterprise.

Request a demo
Problem statement

Build an end-to-end solution including data enrichment, reports, machine learning models, ultimately deploying the models, reports, and dashboards to be used in enterprise applications.

We will see how xVector platform enables data science process typically followed in the industry.

test-data-science.original
Implementation

We are following xVector methodology, which emphasizes collaboration and agility, to implement this solution.

Step 1:

Connect to Datasources

  • First create a Collection by name 'Credit_Card_Fraud_Project', we will save all the resources in this collection
69_cc_fraud_collection_screen

  • The data is available in CSV format, use CSV connector provided by xVector to connect to the data
71_cc_fraud_profile_status_screen

Note: Depending on the amount of data, the upload process, profile process may take time. We recommend to use non-file based connectors if the data is very large (upwards of 10 million records)

See the below demo on how to connect to data.

Step 2:

Validate Data Connection

  • Make sure data connections are successful, check row counts
  • Verify sample data in the datasource to check if data upload was correct
    • If text is present then cross check if all the special characters, formatting was loaded properly
  • Make sure all the required connectors are setup and their life cycle update rules are set properly
    • You can set data refresh rates appropriately

After verification data connection seems successful.

Step 3:

Ascertain Data Quality

  • Go through the profile data provided by xVector for all the datasources

It is recommended to go through the profile data of datasets instead of datasources. This is because you can drill down further in datasets, especially for outliers etc.

So do a copy of the datasource to create a dataset.

  • Review profile information for each column of each datasource
  • Come up with any data enrichment requirements from the profile information
  • Understand the outliers
  • Going forward xVector will be providing a more detailed data quality reports which can be used to evaluate the data quality, watch out for notification regarding this from xVector
  • Understand the data distribution for the columns
  • If multiple life cycle updates (data refresh) has happened, then monitor the data quality changes with each data refresh

Observations:

  • This data appears rather clean. This is potentially because the data is from a competition. Real world data will have several issues. You will have to spend quality time understanding the data quality and coming up with data enrichment requirements.
  • Class column distribution is very skewed. This column has 99.8% rows 0 values i.e. not fraud, only 0.17% are 1 values i.e. fraud transactions. Essentially 492 records are fraud. Highly imbalanced data from the perspective of machine learning.
  • Amount column has several outliers, the mean value is 88.35, but maximum value is 25691.16. Upper quartile (q3) value is 77.17. Let us see distribution plot for some of these measures using data visualization
  • Let us create a simple data understanding report, see the below demo

Step 4:

Data Enrichment

  • Create required datasets from the datasources
  • Create required aggregate datasets using 'Aggregate' functions or
  • Create running totals, partitioned statistics data using 'Window' functions
  • Create as many datasets as required
    • Entity datasets to be used as filters in dashboards
    • Aggregate datasets to be used in reports for fast response times
    • Time series based datasets to be used in trend analysis
    • Datasets for training machine learning models
  • Use as many data enrichment steps as required
  • Document each data enrichment step using the 'notes' feature
  • Re-run 'Profile' from settings screen of datasets to re-profile datasets after data enrichment
    • Changes to data distribution?
    • Changes to profile information?
  • Note that there is no limit on number of datasets that you can create, just make sure you name them appropriately so that can refer to them later on

Implementation of Data Enrichment Steps:

  • Since data is already clean we will not attempt to do any data enrichment
  • We will handle imbalance problem using the parameters that we pass to machine learning drivers that we are going to use
  • Just for demonstration purposes we will do scaling of Amount column using 'Scale' data enrichment function in xVector

Note: After adding data enrichment steps to datasets you need to 'materialize' the datasets to make them available to reports. We are trying to automate this step soon. Materializing a dataset is simple, click on Settings of a dataset > Click on 'Materialize' button. See the screen below.

See the demo below on how to do this data enrichment step.

Step 5:

Data Visualization

  • Use data visualization to understand the data
  • Use pivot tables, charts, sankey diagrams
  • Use entity filters, expression filters
  • Use all the formating settings to make slick dashboards
  • If required re-visit step 4 and step 3 based on the data understanding

Implementation:

  • More charts are added to the earlier created data understanding report
  • This solution will be available to xVector users as a reference when they signup

Step 6:

Machine Learning Models

  • Create advanced machine learning models based on your business requirements
  • Explore and experiment with multiple drivers such as XGBoost, Catboost, Tensorflow
  • Experiment with multiple features (datasets), parameters
  • Compare performance acorss drivers, parameters, features
  • If required re-visit step 4 to enrich and create new features
  • Use the trained models in step 4 as data enrichment step and build new visualizations in step 5 using the model outputs

Implementation:

  • Let us build a simple machine learning model to identify fraud transactions

See the below demo on how to build such a machine learning model.

  • The estimator status will change from 'Training' to 'Trained' as well as in the notifications a new notification will appear once the model training is done
  • After the model is created to view the metrics, feature importance and other artifacts see the below demo

Step 7:

Share With Teams

  • Finally share all the resources such as datasources, datasets, reports, models across your organization
  • Collaborate with your teams to build, explore, enhance the solution

Step 8:

Deploy Models, Visualizations

  • To deploy a model simply 'select' the estimator that you found to be the best. xVector automatically creates a REST API for that estimator

See the below demo to deploy an estimator as REST API

Sample REST API URL looks like this http://xmodel.xvectorlabs.com/xmodel/f4b3641e60484719b247143f3e7f8feb/predict

Get in touch with xVector Support for more details.

  • To embed a dashboard or a report in your own enterprises application, just follow the below demo

Subscribe
close