Skip to content

Dataset

Data is the foundation of every insightful analysis, yet its raw form often lacks the structure and clarity needed for effective decision-making. Within xVector, raw data enters the system from various data sources like transactional databases, APIs, logs, or third-party integrations. Before it can be leveraged for exploratory analysis, machine learning, or operational reporting, it must be refined and structured into a more usable format.

As data flows into the system, it becomes a Dataset — a structured, enriched version ready for analysis and modeling. This transformation begins with profiling, where metadata is extracted and each column is classified based on its data type, statistical type (categorical or numerical), and semantic type (e.g., email, URL, or phone number). This metadata provides context, shaping downstream processes like exploratory analysis or machine learning model training.

The enrichment process brings data to life. Users can leverage an extensive toolkit to profile data, detect anomalies, join datasets, apply entity extraction/NLP and other models, and manually edit numerical, text, or image data for precise refinements. This step bridges the gap between raw data and actionable insights.

The clear distinction between a Data Source and a Dataset isn’t just about workflow organization — it’s about traceability and trust. By maintaining the data lineage, users can always trace back to the source, ensuring transparency in processing and confidence in decision-making.

  1. From a workspace, click on Add and choose the Datasets option.
Workspace Add menu showing the Datasets option

Or choose Create Dataset from the menu options of a Datasource.

Datasource context menu showing Create Dataset option
  1. Configure — Select a datasource from the available list and choose the type of dataset:
    • Fact (default) — a collection of facts such as orders, shipments, inventories
    • Entity — a collection of entities such as suppliers, customers, partners, channels, and other core entities around which the data app creates analytical attributes (e.g., customer segment). Used for filtering and lightweight mastering.
    • Entity Hierarchy — captures hierarchical relationships among entities, useful for filtering reports
Dataset creation dialog with datasource selection and type configuration
  1. Advanced — Set the dataset name, machine specifications, and table options (Record Key, Partition Key).

  2. Select Workspace — Choose a workspace for the dataset.

Once the dataset is created, there are several ways to explore and enrich it.

From the workspace, click the ellipses on the dataset tile (tile with the green icon) to see the available options:

Dataset tile context menu showing all available options
  • View — Takes you to the dataset screen
  • Update Profile — Run each time a dataset is created or updated to view the profile
  • View Profile — View the profile (must run Update Profile first)
  • Create Copy — Creates a copy of the data
  • Generate Report — Generates an AI-powered report dashboard as a starting point
  • Generate Exploratory Report — Generates an AI-powered report to explore the dataset
  • Generate Model — Generates an AI-powered model by choosing feature columns and automatically optimizing training parameters based on your prompt
Dataset explore view showing available actions
  • Activity Logs — Shows the list of activities on that dataset
  • Map Data — Column metadata mappings from source to target (used in data destination)
  • Sync — Synchronizes the dataset with the data source
  • Update — Updates the data as per source
  • Materialize — Persists the dataset in the filesystem
  • Publish — Publishes the data, assigning a version with a guaranteed interface
  • Export — Writes the data to a target system (points to Data Destination)
  • Settings — Opens the settings tab for the dataset
  • Delete — Deletes the dataset

The dataset page view provides a toolbar with features from left to right:

Dataset view page toolbar with all feature icons

Presence — Shows which users are in the workspace. There can be more than one user at a time.

Presence indicator showing active users

GenAI — Ask data-related questions in natural language and get an automatic response. For example: “How many unique values does the age column have?”

Dataset table view with GenAI interaction

Driver (play button) — Starts or shuts down the DML driver for the dataset.

Edit Table — Search or filter each of the columns in the dataset.

Dataset edit table view with column search and filter

Data Enrichment (Σ) — Option to add xflow actions or view enrichment history. See the Enrichment Functions section below.

Profile and Metadata Report:

  • Column Stats — shows histogram and statistical information
  • Correlation — shows the correlation matrix
  • Column Metadata — metadata of the dataset
Column metadata view showing data types, statistical types, and profiling results

Write Back — Write the data to a target system.

Reviews and Version Control — Add reviewers and publish different versions.

Reviews and version control panel

Action Logs — Shows the logs of actions taken on the dataset.

Action logs showing dataset activity history

Alerts — Create, update, or subscribe to alerts. See the Alerts section below.

Comments — Add comments to collaborate with other users.

Settings:

Dataset settings panel with basic and advanced configuration
  • Basic — Name, Description, Workspace, Type (entity or fact)
  • Advanced — Spark parameters, Synchronization settings (policy type: on demand / on schedule / rule-based; write mode: upsert / insert / bulk insert; update profile; anomaly detection; alerts)
  • Share — Share the dataset with users or user groups
  • View — View the data

The dataset is enriched using a series (flow) of functions such as aggregates and filters. Users can also apply trained models to compute new columns. Advanced users can author custom functions to manipulate data.

  1. From a Workspace, click the ellipses on the dataset (tile with green icon) and click View, or double-click the dataset.
  2. The DML driver needs to be up and running (green dot in the icons). Start it using the play-button icon.
  3. Click on the sigma icon (Σ) to open data enrichment.
  4. Click Add an action and choose the function from the dropdown list.
Enrichment function dropdown showing all available actions

Performs a calculation on a set of values and returns a single value such as SUM, Average, etc. Select the group-by columns and the aggregation functions to apply.

Aggregate enrichment function configuration

The “author” function is used for writing a custom function to be applied on the dataset. This is useful for advanced users who need to implement transformations not available in the standard function list.

Author custom function configuration

Update the custom_function in the notebook editor:

Author function notebook editor for writing custom code

After writing the function, do Materialize / Materialize Overwrite as needed to persist the updated dataset.

Write custom SQL queries against the dataset using the SQL editor.

Custom SQL function with SQL editor

Changes the data type of a column in the dataset. Select the column and the target data type.

Datatype enrichment function for changing column types

Removes specific rows from the dataset based on filter criteria.

Delete rows enrichment function

Removes columns from the dataset. Select the columns to be dropped.

Drop columns enrichment function

Removes missing values (null/NaN) from the dataset. Configure which columns to check and the threshold for dropping rows.

Dropna enrichment function for handling missing values

Converts an array/list of items into separate rows. Useful when a column contains lists that need to be flattened into individual records.

Explode enrichment function for flattening arrays

Replaces null values in a dataset with a specified value. Configure the fill value per column or for the entire dataset.

Fillna enrichment function for replacing null values

Extracts specific data from a dataset based on conditions. Define the filter expression using column names, operators, and values.

Filter enrichment function configuration

Joins two datasets based on a particular column. Select the join type (inner, left, right, outer), the target dataset, and the join keys.

Join enrichment function for combining datasets

Normalizes semi-structured JSON data into a flat table format.

JSON normalize enrichment function

Applies a trained model to the dataset to compute new columns. For example, apply a classifier to identify customers likely to churn from the latest order data.

Model apply enrichment function

Creates a new column in the dataset based on a provided expression. Define the column name, data type, and the expression to compute its values.

New column enrichment function

A data transformation tool used to reorganize data from a long format to a wide format. Select the index columns, pivot columns, and aggregation functions.

Pivot enrichment function configuration

Splits string-type columns based on a delimiter into separate columns.

Split column enrichment function

Adds rows from another dataset to the current dataset. The schemas must be compatible.

Union enrichment function for combining row sets

A data transformation tool used to reorganize data from a wide format to a long format — the reverse of pivot.

Unpivot enrichment function

Adds rows that are not duplicates to the dataset. Existing rows matching on the key columns are updated with the new values.

Upsert enrichment function

Performs statistical operations such as rank, row number, and cumulative aggregations over a window of rows defined by partition and ordering columns.

Window enrichment function for statistical operations

Use the SQL editor to write your own queries against the dataset.

SQL editor for writing custom queries

The following options are common across enrichment functions:

  • Validate — Validates the provided input expression and configuration. It is recommended to run validate first, verify the inputs, and update if required.
  • Save — Runs the selected actions and saves the configuration. Saving does not result in actual data being created; it only saves the logic of applied transformations. To create actual data, you must materialize the dataset.
  • Materialize — Persists the data. Available at the bottom of the xflow-action tab (accessible via the sigma icon Σ).
  • Materialize Overwrite — Creates a new dataset and persists. Overwrites and recreates the data, deleting the older version. Recommended when schema changes occur.
Xflow actions panel showing materialize options

Users can set up alerts based on rules for thresholds or drifts.

Alerts configuration panel

Create a threshold alert with the following steps:

  1. Alert Name — Provide a name for the alert
  2. Alert Description — Provide a description
  3. Scope — Choose the alert to be either public or private
  4. Type — Choose Threshold
  5. Add Preprocessing Step — Add a preprocessing step (uses the same enrichment functions described above)
  6. Expression — Provide a custom expression for the alert to run
  7. Validate — Best practice to validate input before saving
  8. Save — Save and run the alert on the dataset
Threshold alert save result Threshold alert preview Threshold alert completion view

Drifts are calculated in the context of models. They are calculated when the dataset is synchronized with the Data Source. The data source should have indices for synchronization.

Dataset drift detection integrated with models