DataHub Integration Guide

This guide covers the installation and configuration of DataHub as part of your Oversight platform.

Overview

DataHub is a modern data catalog that helps you discover, understand, and govern your data. As part of Oversight, it provides the metadata layer that connects all your data sources.

Installation

Prerequisites

Python 3.8 or later
Docker and Docker Compose
At least 8GB RAM
10GB free disk space

Quick Start Installation

The easiest way to get started with DataHub is using the CLI:

# Install Python dependencies
python3 -m pip install --upgrade pip wheel setuptools
 
# Install DataHub CLI
python3 -m pip install --upgrade acryl-datahub
 
# Quick start with Docker
datahub docker quickstart

This command will:

Pull all required Docker images
Start DataHub services (GMS, Frontend, Elasticsearch, PostgreSQL, Kafka)
Initialize the database
Load sample metadata

Access DataHub

Once installation completes, access DataHub at: http://localhost:9002

Default credentials:

Username: datahub
Password: datahub

Architecture

DataHub consists of several components:

DataHub Frontend: React-based web UI
DataHub GMS (Graph Metadata Service): Core metadata API
Elasticsearch: Search and indexing
PostgreSQL: Primary metadata store
Kafka: Event streaming
Schema Registry: Avro schema management

Configuration

Basic Configuration

DataHub configuration is managed through environment variables and configuration files.

Environment Variables

Create a .datahub/datahub.properties file:

# DataHub host
datahub.host=localhost:9002
 
# Authentication
auth.enabled=true
auth.type=oidc
 
# Search
elasticsearch.host=elasticsearch:9200
 
# Database
db.host=postgres:5432
db.name=datahub

Authentication Integration

Keycloak Integration

Configure DataHub to use Keycloak for SSO:

# datahub-gms/env.properties
auth.oidc.enabled=true
auth.oidc.clientId=oversight-app
auth.oidc.clientSecret=your-keycloak-secret
auth.oidc.discoveryUri=http://localhost:8080/realms/oversight/.well-known/openid-configuration
auth.oidc.userNameClaim=preferred_username
auth.oidc.userNameClaimRegex=.*

Restart DataHub services:

datahub docker restart

Data Ingestion

DataHub supports ingestion from 50+ data sources.

UI-Based Ingestion

Navigate to Ingestion → Sources
Click Create new source
Select your data source type
Configure connection details
Test connection
Set ingestion schedule
Click Save & Run

CLI-Based Ingestion

Install Source Connector

# Example: Install MySQL connector
pip install 'acryl-datahub[mysql]'
 
# Example: Install Snowflake connector
pip install 'acryl-datahub[snowflake]'

Create Recipe File

Create a recipe.yml file:

# MySQL example
source:
  type: mysql
  config:
    host_port: localhost:3306
    database: mydb
    username: ${MYSQL_USER}
    password: ${MYSQL_PASSWORD}
    
sink:
  type: datahub-rest
  config:
    server: http://localhost:8080
    token: ${DATAHUB_TOKEN}

Run Ingestion

datahub ingest -c recipe.yml

Supported Data Sources

Databases

MySQL, PostgreSQL, Oracle, SQL Server
MongoDB, Cassandra, DynamoDB
MariaDB, Teradata, Vertica

Data Warehouses

Snowflake
BigQuery
Redshift
Azure Synapse
Databricks

Data Lakes

S3
Azure Data Lake Storage (ADLS)
Google Cloud Storage (GCS)
HDFS

BI & Visualization

Tableau
Looker
PowerBI
Superset
Metabase

Data Processing

Apache Spark
Apache Airflow
dbt
Apache Flink
Kafka

ML Platforms

SageMaker
MLflow
Kubeflow

Common Ingestion Examples

Snowflake Ingestion

source:
  type: snowflake
  config:
    account_id: abc123.us-east-1
    warehouse: COMPUTE_WH
    username: ${SNOWFLAKE_USER}
    password: ${SNOWFLAKE_PASSWORD}
    role: DATAHUB_ROLE
    database_pattern:
      allow:
        - "PROD_.*"
    schema_pattern:
      deny:
        - ".*_TEMP"

PostgreSQL Ingestion

source:
  type: postgres
  config:
    host_port: localhost:5432
    database: production
    username: ${POSTGRES_USER}
    password: ${POSTGRES_PASSWORD}
    include_tables: true
    include_views: true
    profiling:
      enabled: true

S3 Data Lake Ingestion

source:
  type: s3
  config:
    aws_access_key_id: ${AWS_ACCESS_KEY}
    aws_secret_access_key: ${AWS_SECRET_KEY}
    aws_region: us-east-1
    path_specs:
      - include: s3://my-bucket/data/**/*.parquet
        table_name: my_dataset

dbt Integration

source:
  type: dbt
  config:
    manifest_path: /path/to/target/manifest.json
    catalog_path: /path/to/target/catalog.json
    sources_path: /path/to/target/sources.json
    target_platform: snowflake

Features & Usage

Data Discovery

Use the search bar to find:

Datasets
Dashboards
Data pipelines
ML models
Users

Search syntax:

name:customer* - Find by name
platform:snowflake - Filter by platform
tag:pii - Filter by tag
owner:john.doe - Filter by owner

Data Lineage

View how data flows through your organization:

Navigate to a dataset
Click on Lineage tab
View upstream and downstream dependencies
Click on nodes to explore related assets

Glossary Terms

Create business glossary:

Navigate to Govern → Glossary
Click Create Term
Define:
- Term name
- Definition
- Related terms
- Owners
Link to datasets

Domains

Organize data by business domains:

Navigate to Govern → Domains
Click Create Domain
Set domain name (e.g., “Marketing”, “Finance”)
Assign datasets to domains
Set domain owners

Data Quality

Monitor data quality:

Navigate to Govern → Quality
Create assertions:
- Freshness checks
- Volume checks
- Schema validation
- Custom SQL checks
Set up alerts

Data Contracts

Define data expectations:

# Example data contract
dataset: urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.sales.orders,PROD)
contract:
  freshness:
    maxAge: 24h
  schema:
    fields:
      - name: order_id
        type: integer
        nullable: false
      - name: customer_id
        type: integer
        nullable: false
  quality:
    - type: sql
      statement: "SELECT COUNT(*) FROM orders WHERE amount < 0"
      operator: EQUALS
      value: 0

GraphQL API

DataHub provides a powerful GraphQL API:

# Query datasets
query {
  search(
    input: {
      type: DATASET
      query: "customers"
      start: 0
      count: 10
    }
  ) {
    start
    count
    total
    searchResults {
      entity {
        urn
        type
        ... on Dataset {
          name
          description
          platform {
            name
          }
        }
      }
    }
  }
}

Access GraphiQL interface at: http://localhost:9002/api/graphiql

Python SDK

Programmatic access with Python:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import DatasetPropertiesClass
 
# Create emitter
emitter = DatahubRestEmitter("http://localhost:8080")
 
# Create dataset URN
dataset_urn = make_dataset_urn(
    platform="snowflake",
    name="prod.sales.orders"
)
 
# Update dataset properties
properties = DatasetPropertiesClass(
    description="Customer orders table",
    customProperties={
        "owner": "sales-team",
        "criticality": "high"
    }
)
 
# Emit metadata
emitter.emit_mcp(
    MetadataChangeProposalWrapper(
        entityUrn=dataset_urn,
        aspect=properties
    )
)

Integration with Oversight

With MinIO

Store large metadata artifacts in MinIO:

# Configure DataHub to use MinIO
storage:
  type: s3
  s3:
    endpoint: http://localhost:9090
    accessKey: minio
    secretKey: miniosecret
    bucket: datahub-artifacts

With Langfuse

Cross-reference ML models:

Track which datasets are used by LLM applications
Link Langfuse traces to DataHub datasets
Understand data lineage for AI models

With Keycloak

Already covered in Authentication Integration section above.

Monitoring

Health Check

curl http://localhost:8080/health

Metrics

DataHub exposes metrics at: http://localhost:8080/metrics

Logs

View logs:

docker logs datahub-gms-1
docker logs datahub-frontend-react-1

Troubleshooting

Issue: Ingestion Fails

Solution:

Check source connectivity
Verify credentials
Review ingestion logs: datahub ingest -c recipe.yml --debug

Issue: Search Not Working

Solution:

Check Elasticsearch health
Reindex: datahub index rebuild

Issue: Lineage Not Showing

Solution:

Ensure upstream ingestion includes lineage
Check if lineage data was ingested
Verify graph service is running

Production Deployment

Kubernetes Deployment

# Add DataHub Helm repo
helm repo add datahub https://helm.datahubproject.io/
 
# Install DataHub
helm install datahub datahub/datahub \
  --set global.datahub.gms.image.tag=latest \
  --set global.elasticsearch.host=elasticsearch-master

Configuration Best Practices

Use external databases (PostgreSQL, Elasticsearch)
Enable authentication (OIDC with Keycloak)
Set up SSL/TLS
Configure backups
Monitor performance
Scale horizontally (multiple GMS instances)

Resources

Next Steps

Langfuse Setup Advanced

DataHub Integration Guide

Overview

Installation

Prerequisites

Quick Start Installation

Access DataHub

Architecture

Configuration

Basic Configuration

Environment Variables

Authentication Integration

Keycloak Integration

Data Ingestion

UI-Based Ingestion

CLI-Based Ingestion

Install Source Connector

Create Recipe File

Run Ingestion

Supported Data Sources

Databases

Data Warehouses

Data Lakes

BI & Visualization

Data Processing

ML Platforms

Common Ingestion Examples

Snowflake Ingestion

PostgreSQL Ingestion

S3 Data Lake Ingestion

dbt Integration

Features & Usage

Data Discovery

Data Lineage

Glossary Terms

Domains

Tags

Data Quality

Data Contracts

GraphQL API

Python SDK

Integration with Oversight

With MinIO

With Langfuse

With Keycloak

Monitoring

Health Check

Metrics

Logs

Troubleshooting

Issue: Ingestion Fails

Issue: Search Not Working

Issue: Lineage Not Showing

Production Deployment

Kubernetes Deployment

Configuration Best Practices

Resources

Next Steps