DataHub Integration Guide
This guide covers the installation and configuration of DataHub as part of your Oversight platform.
Overview
DataHub is a modern data catalog that helps you discover, understand, and govern your data. As part of Oversight, it provides the metadata layer that connects all your data sources.
Installation
Prerequisites
- Python 3.8 or later
- Docker and Docker Compose
- At least 8GB RAM
- 10GB free disk space
Quick Start Installation
The easiest way to get started with DataHub is using the CLI:
# Install Python dependencies
python3 -m pip install --upgrade pip wheel setuptools
# Install DataHub CLI
python3 -m pip install --upgrade acryl-datahub
# Quick start with Docker
datahub docker quickstartThis command will:
- Pull all required Docker images
- Start DataHub services (GMS, Frontend, Elasticsearch, PostgreSQL, Kafka)
- Initialize the database
- Load sample metadata
Access DataHub
Once installation completes, access DataHub at: http://localhost:9002
Default credentials:
- Username:
datahub - Password:
datahub
Architecture
DataHub consists of several components:
- DataHub Frontend: React-based web UI
- DataHub GMS (Graph Metadata Service): Core metadata API
- Elasticsearch: Search and indexing
- PostgreSQL: Primary metadata store
- Kafka: Event streaming
- Schema Registry: Avro schema management
Configuration
Basic Configuration
DataHub configuration is managed through environment variables and configuration files.
Environment Variables
Create a .datahub/datahub.properties file:
# DataHub host
datahub.host=localhost:9002
# Authentication
auth.enabled=true
auth.type=oidc
# Search
elasticsearch.host=elasticsearch:9200
# Database
db.host=postgres:5432
db.name=datahubAuthentication Integration
Keycloak Integration
Configure DataHub to use Keycloak for SSO:
# datahub-gms/env.properties
auth.oidc.enabled=true
auth.oidc.clientId=oversight-app
auth.oidc.clientSecret=your-keycloak-secret
auth.oidc.discoveryUri=http://localhost:8080/realms/oversight/.well-known/openid-configuration
auth.oidc.userNameClaim=preferred_username
auth.oidc.userNameClaimRegex=.*Restart DataHub services:
datahub docker restartData Ingestion
DataHub supports ingestion from 50+ data sources.
UI-Based Ingestion
- Navigate to Ingestion → Sources
- Click Create new source
- Select your data source type
- Configure connection details
- Test connection
- Set ingestion schedule
- Click Save & Run
CLI-Based Ingestion
Install Source Connector
# Example: Install MySQL connector
pip install 'acryl-datahub[mysql]'
# Example: Install Snowflake connector
pip install 'acryl-datahub[snowflake]'Create Recipe File
Create a recipe.yml file:
# MySQL example
source:
type: mysql
config:
host_port: localhost:3306
database: mydb
username: ${MYSQL_USER}
password: ${MYSQL_PASSWORD}
sink:
type: datahub-rest
config:
server: http://localhost:8080
token: ${DATAHUB_TOKEN}Run Ingestion
datahub ingest -c recipe.ymlSupported Data Sources
Databases
- MySQL, PostgreSQL, Oracle, SQL Server
- MongoDB, Cassandra, DynamoDB
- MariaDB, Teradata, Vertica
Data Warehouses
- Snowflake
- BigQuery
- Redshift
- Azure Synapse
- Databricks
Data Lakes
- S3
- Azure Data Lake Storage (ADLS)
- Google Cloud Storage (GCS)
- HDFS
BI & Visualization
- Tableau
- Looker
- PowerBI
- Superset
- Metabase
Data Processing
- Apache Spark
- Apache Airflow
- dbt
- Apache Flink
- Kafka
ML Platforms
- SageMaker
- MLflow
- Kubeflow
Common Ingestion Examples
Snowflake Ingestion
source:
type: snowflake
config:
account_id: abc123.us-east-1
warehouse: COMPUTE_WH
username: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
role: DATAHUB_ROLE
database_pattern:
allow:
- "PROD_.*"
schema_pattern:
deny:
- ".*_TEMP"PostgreSQL Ingestion
source:
type: postgres
config:
host_port: localhost:5432
database: production
username: ${POSTGRES_USER}
password: ${POSTGRES_PASSWORD}
include_tables: true
include_views: true
profiling:
enabled: trueS3 Data Lake Ingestion
source:
type: s3
config:
aws_access_key_id: ${AWS_ACCESS_KEY}
aws_secret_access_key: ${AWS_SECRET_KEY}
aws_region: us-east-1
path_specs:
- include: s3://my-bucket/data/**/*.parquet
table_name: my_datasetdbt Integration
source:
type: dbt
config:
manifest_path: /path/to/target/manifest.json
catalog_path: /path/to/target/catalog.json
sources_path: /path/to/target/sources.json
target_platform: snowflakeFeatures & Usage
Data Discovery
Use the search bar to find:
- Datasets
- Dashboards
- Data pipelines
- ML models
- Users
Search syntax:
name:customer*- Find by nameplatform:snowflake- Filter by platformtag:pii- Filter by tagowner:john.doe- Filter by owner
Data Lineage
View how data flows through your organization:
- Navigate to a dataset
- Click on Lineage tab
- View upstream and downstream dependencies
- Click on nodes to explore related assets
Glossary Terms
Create business glossary:
- Navigate to Govern → Glossary
- Click Create Term
- Define:
- Term name
- Definition
- Related terms
- Owners
- Link to datasets
Domains
Organize data by business domains:
- Navigate to Govern → Domains
- Click Create Domain
- Set domain name (e.g., “Marketing”, “Finance”)
- Assign datasets to domains
- Set domain owners
Tags
Add metadata tags:
- Navigate to dataset
- Click Add Tags
- Create or select tags (e.g., “PII”, “Critical”)
- Tags appear in search and can trigger policies
Data Quality
Monitor data quality:
- Navigate to Govern → Quality
- Create assertions:
- Freshness checks
- Volume checks
- Schema validation
- Custom SQL checks
- Set up alerts
Data Contracts
Define data expectations:
# Example data contract
dataset: urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.sales.orders,PROD)
contract:
freshness:
maxAge: 24h
schema:
fields:
- name: order_id
type: integer
nullable: false
- name: customer_id
type: integer
nullable: false
quality:
- type: sql
statement: "SELECT COUNT(*) FROM orders WHERE amount < 0"
operator: EQUALS
value: 0GraphQL API
DataHub provides a powerful GraphQL API:
# Query datasets
query {
search(
input: {
type: DATASET
query: "customers"
start: 0
count: 10
}
) {
start
count
total
searchResults {
entity {
urn
type
... on Dataset {
name
description
platform {
name
}
}
}
}
}
}Access GraphiQL interface at: http://localhost:9002/api/graphiql
Python SDK
Programmatic access with Python:
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import DatasetPropertiesClass
# Create emitter
emitter = DatahubRestEmitter("http://localhost:8080")
# Create dataset URN
dataset_urn = make_dataset_urn(
platform="snowflake",
name="prod.sales.orders"
)
# Update dataset properties
properties = DatasetPropertiesClass(
description="Customer orders table",
customProperties={
"owner": "sales-team",
"criticality": "high"
}
)
# Emit metadata
emitter.emit_mcp(
MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=properties
)
)Integration with Oversight
With MinIO
Store large metadata artifacts in MinIO:
# Configure DataHub to use MinIO
storage:
type: s3
s3:
endpoint: http://localhost:9090
accessKey: minio
secretKey: miniosecret
bucket: datahub-artifactsWith Langfuse
Cross-reference ML models:
- Track which datasets are used by LLM applications
- Link Langfuse traces to DataHub datasets
- Understand data lineage for AI models
With Keycloak
Already covered in Authentication Integration section above.
Monitoring
Health Check
curl http://localhost:8080/healthMetrics
DataHub exposes metrics at: http://localhost:8080/metrics
Logs
View logs:
docker logs datahub-gms-1
docker logs datahub-frontend-react-1Troubleshooting
Issue: Ingestion Fails
Solution:
- Check source connectivity
- Verify credentials
- Review ingestion logs:
datahub ingest -c recipe.yml --debug
Issue: Search Not Working
Solution:
- Check Elasticsearch health
- Reindex:
datahub index rebuild
Issue: Lineage Not Showing
Solution:
- Ensure upstream ingestion includes lineage
- Check if lineage data was ingested
- Verify graph service is running
Production Deployment
Kubernetes Deployment
# Add DataHub Helm repo
helm repo add datahub https://helm.datahubproject.io/
# Install DataHub
helm install datahub datahub/datahub \
--set global.datahub.gms.image.tag=latest \
--set global.elasticsearch.host=elasticsearch-masterConfiguration Best Practices
- Use external databases (PostgreSQL, Elasticsearch)
- Enable authentication (OIDC with Keycloak)
- Set up SSL/TLS
- Configure backups
- Monitor performance
- Scale horizontally (multiple GMS instances)