Dataset

Data is the foundation of every insightful analysis, yet its raw form often lacks the structure and clarity needed for effective decision-making. Within xVector, raw data enters the system from various data sources like transactional databases, APIs, logs, or third-party integrations. Before it can be leveraged for exploratory analysis, machine learning, or operational reporting, it must be refined and structured into a more usable format.

Refining the Raw: Datasets

As data flows into the system, it becomes a Dataset — a structured, enriched version ready for analysis and modeling. This transformation begins with profiling, where metadata is extracted and each column is classified based on its data type, statistical type (categorical or numerical), and semantic type (e.g., email, URL, or phone number). This metadata provides context, shaping downstream processes like exploratory analysis or machine learning model training.

The enrichment process brings data to life. Users can leverage an extensive toolkit to profile data, detect anomalies, join datasets, apply entity extraction/NLP and other models, and manually edit numerical, text, or image data for precise refinements. This step bridges the gap between raw data and actionable insights.

The Power of Traceability

The clear distinction between a Data Source and a Dataset isn’t just about workflow organization — it’s about traceability and trust. By maintaining the data lineage, users can always trace back to the source, ensuring transparency in processing and confidence in decision-making.

Creating a Dataset

From a workspace, click on Add and choose the Datasets option.

Workspace Add menu showing the Datasets option

Or choose Create Dataset from the menu options of a Datasource.

Datasource context menu showing Create Dataset option

Configure — Select a datasource from the available list and choose the type of dataset:
- Fact (default) — a collection of facts such as orders, shipments, inventories
- Entity — a collection of entities such as suppliers, customers, partners, channels, and other core entities around which the data app creates analytical attributes (e.g., customer segment). Used for filtering and lightweight mastering.
- Entity Hierarchy — captures hierarchical relationships among entities, useful for filtering reports

Dataset creation dialog with datasource selection and type configuration

Advanced — Set the dataset name, machine specifications, and table options (Record Key, Partition Key).
Select Workspace — Choose a workspace for the dataset.

Exploring and Enriching a Dataset

Once the dataset is created, there are several ways to explore and enrich it.

Workspace View Options

From the workspace, click the ellipses on the dataset tile (tile with the green icon) to see the available options:

Dataset tile context menu showing all available options

View — Takes you to the dataset screen
Update Profile — Run each time a dataset is created or updated to view the profile
View Profile — View the profile (must run Update Profile first)
Create Copy — Creates a copy of the data
Generate Report — Generates an AI-powered report dashboard as a starting point
Generate Exploratory Report — Generates an AI-powered report to explore the dataset
Generate Model — Generates an AI-powered model by choosing feature columns and automatically optimizing training parameters based on your prompt

Dataset explore view showing available actions

Activity Logs — Shows the list of activities on that dataset
Map Data — Column metadata mappings from source to target (used in data destination)
Sync — Synchronizes the dataset with the data source
Update — Updates the data as per source
Materialize — Persists the dataset in the filesystem
Publish — Publishes the data, assigning a version with a guaranteed interface
Export — Writes the data to a target system (points to Data Destination)
Settings — Opens the settings tab for the dataset
Delete — Deletes the dataset

Dataset View Page

The dataset page view provides a toolbar with features from left to right:

Presence — Shows which users are in the workspace. There can be more than one user at a time.

GenAI — Ask data-related questions in natural language and get an automatic response. For example: “How many unique values does the age column have?”

Dataset table view with GenAI interaction

Driver (play button) — Starts or shuts down the DML driver for the dataset.

Edit Table — Search or filter each of the columns in the dataset.

Dataset edit table view with column search and filter

Data Enrichment (Σ) — Option to add xflow actions or view enrichment history. See the Enrichment Functions section below.

Profile and Metadata Report:

Column Stats — shows histogram and statistical information
Correlation — shows the correlation matrix
Column Metadata — metadata of the dataset

Column metadata view showing data types, statistical types, and profiling results

Write Back — Write the data to a target system.

Reviews and Version Control — Add reviewers and publish different versions.

Action Logs — Shows the logs of actions taken on the dataset.

Action logs showing dataset activity history

Alerts — Create, update, or subscribe to alerts. See the Alerts section below.

Comments — Add comments to collaborate with other users.

Settings:

Basic — Name, Description, Workspace, Type (entity or fact)
Advanced — Spark parameters, Synchronization settings (policy type: on demand / on schedule / rule-based; write mode: upsert / insert / bulk insert; update profile; anomaly detection; alerts)
Share — Share the dataset with users or user groups
View — View the data

Enrichment Functions

The dataset is enriched using a series (flow) of functions such as aggregates and filters. Users can also apply trained models to compute new columns. Advanced users can author custom functions to manipulate data.

Getting to Enrichment Functions

From a Workspace, click the ellipses on the dataset (tile with green icon) and click View, or double-click the dataset.
The DML driver needs to be up and running (green dot in the icons). Start it using the play-button icon.
Click on the sigma icon (Σ) to open data enrichment.
Click Add an action and choose the function from the dropdown list.

Enrichment function dropdown showing all available actions

Aggregate

Performs a calculation on a set of values and returns a single value such as SUM, Average, etc. Select the group-by columns and the aggregation functions to apply.

Author

The “author” function is used for writing a custom function to be applied on the dataset. This is useful for advanced users who need to implement transformations not available in the standard function list.

Update the custom_function in the notebook editor:

Author function notebook editor for writing custom code

After writing the function, do Materialize / Materialize Overwrite as needed to persist the updated dataset.

Custom SQL

Write custom SQL queries against the dataset using the SQL editor.

Datatype

Changes the data type of a column in the dataset. Select the column and the target data type.

Delete Rows

Removes specific rows from the dataset based on filter criteria.

Drop Columns

Removes columns from the dataset. Select the columns to be dropped.

Dropna

Removes missing values (null/NaN) from the dataset. Configure which columns to check and the threshold for dropping rows.

Explode

Converts an array/list of items into separate rows. Useful when a column contains lists that need to be flattened into individual records.

Fillna

Replaces null values in a dataset with a specified value. Configure the fill value per column or for the entire dataset.

Filter

Extracts specific data from a dataset based on conditions. Define the filter expression using column names, operators, and values.

Join

Joins two datasets based on a particular column. Select the join type (inner, left, right, outer), the target dataset, and the join keys.

JSON Normalize

Normalizes semi-structured JSON data into a flat table format.

Model Apply

Applies a trained model to the dataset to compute new columns. For example, apply a classifier to identify customers likely to churn from the latest order data.

New Column

Creates a new column in the dataset based on a provided expression. Define the column name, data type, and the expression to compute its values.

Pivot

A data transformation tool used to reorganize data from a long format to a wide format. Select the index columns, pivot columns, and aggregation functions.

Split

Splits string-type columns based on a delimiter into separate columns.

Union

Adds rows from another dataset to the current dataset. The schemas must be compatible.

Unpivot

A data transformation tool used to reorganize data from a wide format to a long format — the reverse of pivot.

Upsert

Adds rows that are not duplicates to the dataset. Existing rows matching on the key columns are updated with the new values.

Window

Performs statistical operations such as rank, row number, and cumulative aggregations over a window of rows defined by partition and ordering columns.

SQL Editor

Use the SQL editor to write your own queries against the dataset.

Common Options

The following options are common across enrichment functions:

Validate — Validates the provided input expression and configuration. It is recommended to run validate first, verify the inputs, and update if required.
Save — Runs the selected actions and saves the configuration. Saving does not result in actual data being created; it only saves the logic of applied transformations. To create actual data, you must materialize the dataset.
Materialize — Persists the data. Available at the bottom of the xflow-action tab (accessible via the sigma icon Σ).
Materialize Overwrite — Creates a new dataset and persists. Overwrites and recreates the data, deleting the older version. Recommended when schema changes occur.

Xflow actions panel showing materialize options

Alerts

Users can set up alerts based on rules for thresholds or drifts.

Threshold

Create a threshold alert with the following steps:

Alert Name — Provide a name for the alert
Alert Description — Provide a description
Scope — Choose the alert to be either public or private
Type — Choose Threshold
Add Preprocessing Step — Add a preprocessing step (uses the same enrichment functions described above)
Expression — Provide a custom expression for the alert to run
Validate — Best practice to validate input before saving
Save — Save and run the alert on the dataset

Drift

Drifts are calculated in the context of models. They are calculated when the dataset is synchronized with the Data Source. The data source should have indices for synchronization.