Databricks
Overview
Databricks is a unified data analytics platform powered by Apache Spark. SaddleData connects to Databricks as a Destination, allowing you to sync data from various sources directly into your Delta Lake tables.
Prerequisites
To connect to Databricks, you will need:
- A Databricks Workspace.
- A SQL Warehouse or an All-Purpose Cluster to execute queries.
- A Personal Access Token (PAT) for authentication.
Configuration
When creating a Databricks Integration, you will need to provide the following information:
- Host: The Server Hostname of your cluster or warehouse (e.g.,
adb-123456789.1.azuredatabricks.net). - Port: The port to connect to (typically
443). - HTTP Path: The HTTP Path for your cluster or warehouse.
- Personal Access Token: A generated token to authenticate the connection.
- Catalog: (Optional) The Unity Catalog name to use. If omitted, the default catalog (usually
hive_metastoreormain) is used. - Schema: (Optional) The default schema (database) to write to. If omitted,
defaultis used unless fully qualified table names are provided in the flow.
Sync Modes
When using Databricks as a destination, you can choose from the following sync modes:
- Full Refresh - Overwrite: Replaces all data in the destination table. SaddleData uses
TRUNCATE TABLEor drops and recreates the table. - Incremental - Append: Appends new records to the destination table.
- Incremental - Deduped (Upsert): Updates existing records and inserts new records based on a primary key using the
MERGE INTOcommand.
Schema Evolution
SaddleData takes advantage of Delta Lake capabilities for schema evolution:
- Automatic Table Creation: If the destination table does not exist, SaddleData creates it automatically using
USING DELTA. - Schema Drift: SaddleData can automatically add new columns to the destination table as they appear in the source data (
ALTER TABLE).
Declarative Configuration
apiVersion: v1
kind: Connection
metadata:
name: databricks-connection
spec:
connectorId: databricks
configuration:
host: localhost
port: 5432
http_path: my-http-path
token: '********'
catalog: my-catalog
schema: my-schema
capability: destination