Databricks-Style Column Profiling in the Data Explorer

Inline column statistics with distribution sparklines in the table schema view. Computed from real SQL queries, not just Iceberg manifest metadata.

The feature

When you select a table in the Data Explorer and click Profile, every column gets inline statistics:

Column	Type	Nulls %	Distinct	Min	Max	Distribution
id	long	0.0%	3	1	3	bar chart
name	string	0.0%	3	Alice	Charlie	bar chart
email	string	0.0%	3	alice@…	charlie@…	bar chart

Nulls % is color-coded: green under 10%, amber 10-50%, red above 50%
Distinct uses compact number formatting (1.2K, 3.5M)
Min/Max come from SQL MIN(CAST(col AS VARCHAR)) / MAX(...)
Distribution is a Recharts sparkline showing top-N value frequencies with hover tooltips

Two-phase profiling

Phase 1: Single aggregation query (fast)

One SQL query computes stats for ALL columns at once:

SELECT COUNT(*) AS "_total",
  COUNT("id") AS "id_non_null",
  COUNT(DISTINCT "id") AS "id_distinct",
  MIN(CAST("id" AS VARCHAR)) AS "id_min",
  MAX(CAST("id" AS VARCHAR)) AS "id_max",
  COUNT("name") AS "name_non_null",
  -- ... for all columns
FROM "main_warehouse"."analytics_db"."user_summary"

This gives null counts, distinct counts, min/max for every column in a single round-trip. The frontend calculates null percentage from total - non_null.

Phase 2: Distribution queries (parallel)

For each column (max 10), a separate top-N query runs in parallel:

SELECT CAST("name" AS VARCHAR) AS val, COUNT(*) AS cnt
FROM "main_warehouse"."analytics_db"."user_summary"
WHERE "name" IS NOT NULL
GROUP BY "name"
ORDER BY cnt DESC
LIMIT 8

The results feed into Recharts BarChart sparklines. Tiny 60x20px charts with opacity encoding (more frequent = more opaque) and hover tooltips showing value: count.

Zero-cost manifest stats

Before clicking Profile, the schema table tries to extract stats from the Iceberg manifest metadata: null_count, distinct_count, lower_bound, upper_bound per column. These are free (no query needed) but depend on whether the Iceberg writer collected column statistics. Many writers don’t, so these are often empty.

The Profile button fills the gaps with actual SQL.

Security: SQL identifier escaping

Column names and table identifiers are user-derived (from catalog metadata). We escape them properly:

function escapeIdentifier(name: string): string {
  return '"' + name.replace(/"/g, '""') + '"';
}

This prevents SQL injection from column names containing quotes, semicolons, or other special characters. The same escaping applies to catalog, namespace, and table names in the fully-qualified table reference.

Stats cards

Above the schema table, four summary cards show:

Columns total count
Rows from manifest total-records
Data Files from manifest total-data-files
Format Version for the Iceberg format

These come from the manifest metadata (zero cost) and appear immediately without clicking Profile.

Column profiling was added in April 2026 as part of the Data Explorer enhancements.