Databricks

Overview

Databricks is a unified data analytics platform powered by Apache Spark. SaddleData connects to Databricks as a Destination, allowing you to sync data from various sources directly into your Delta Lake tables.

Prerequisites

To connect to Databricks, you will need:

A Databricks Workspace.
A SQL Warehouse or an All-Purpose Cluster to execute queries.
A Personal Access Token (PAT) for authentication.

Configuration

When creating a Databricks Integration, you will need to provide the following information:

Host: The Server Hostname of your cluster or warehouse (e.g., adb-123456789.1.azuredatabricks.net).
Port: The port to connect to (typically 443).
HTTP Path: The HTTP Path for your cluster or warehouse.
Personal Access Token: A generated token to authenticate the connection.
Catalog: (Optional) The Unity Catalog name to use. If omitted, the default catalog (usually hive_metastore or main) is used.
Schema: (Optional) The default schema (database) to write to. If omitted, default is used unless fully qualified table names are provided in the flow.

Sync Modes

When using Databricks as a destination, you can choose from the following sync modes:

Full Refresh - Overwrite: Replaces all data in the destination table. SaddleData uses TRUNCATE TABLE or drops and recreates the table.
Incremental - Append: Appends new records to the destination table.
Incremental - Deduped (Upsert): Updates existing records and inserts new records based on a primary key using the MERGE INTO command.

Schema Evolution

SaddleData takes advantage of Delta Lake capabilities for schema evolution:

Automatic Table Creation: If the destination table does not exist, SaddleData creates it automatically using USING DELTA.
Schema Drift: SaddleData can automatically add new columns to the destination table as they appear in the source data (ALTER TABLE).

Declarative Configuration

apiVersion: v1
kind: Connection
metadata:
  name: databricks-connection
spec:
  connectorId: databricks
  configuration:
    host: localhost
    port: 5432
    http_path: my-http-path
    token: '********'
    catalog: my-catalog
    schema: my-schema
    capability: destination

Overview​

Prerequisites​

Configuration​

Sync Modes​

Schema Evolution​

Declarative Configuration​