Skip to main content

Databricks

Overview

Databricks is a unified data analytics platform powered by Apache Spark. SaddleData connects to Databricks as a Destination, allowing you to sync data from various sources directly into your Delta Lake tables.

Prerequisites

To connect to Databricks, you will need:

  • A Databricks Workspace.
  • A SQL Warehouse or an All-Purpose Cluster to execute queries.
  • A Personal Access Token (PAT) for authentication.

Configuration

When creating a Databricks Integration, you will need to provide the following information:

  • Host: The Server Hostname of your cluster or warehouse (e.g., adb-123456789.1.azuredatabricks.net).
  • Port: The port to connect to (typically 443).
  • HTTP Path: The HTTP Path for your cluster or warehouse.
  • Personal Access Token: A generated token to authenticate the connection.
  • Catalog: (Optional) The Unity Catalog name to use. If omitted, the default catalog (usually hive_metastore or main) is used.
  • Schema: (Optional) The default schema (database) to write to. If omitted, default is used unless fully qualified table names are provided in the flow.

Sync Modes

When using Databricks as a destination, you can choose from the following sync modes:

  • Full Refresh - Overwrite: Replaces all data in the destination table. SaddleData uses TRUNCATE TABLE or drops and recreates the table.
  • Incremental - Append: Appends new records to the destination table.
  • Incremental - Deduped (Upsert): Updates existing records and inserts new records based on a primary key using the MERGE INTO command.

Schema Evolution

SaddleData takes advantage of Delta Lake capabilities for schema evolution:

  • Automatic Table Creation: If the destination table does not exist, SaddleData creates it automatically using USING DELTA.
  • Schema Drift: SaddleData can automatically add new columns to the destination table as they appear in the source data (ALTER TABLE).

Declarative Configuration

apiVersion: v1
kind: Connection
metadata:
name: databricks-connection
spec:
connectorId: databricks
configuration:
host: localhost
port: 5432
http_path: my-http-path
token: '********'
catalog: my-catalog
schema: my-schema
capability: destination