Databricks-Style Column Profiling in the Data Explorer

Inline column statistics with distribution sparklines computed from real SQL queries, not just Iceberg manifest metadata.

Inline column statistics with distribution sparklines in the table schema view. Computed from real SQL queries, not just Iceberg manifest metadata.


The feature

When you select a table in the Data Explorer and click Profile, every column gets inline statistics:

ColumnTypeNulls %DistinctMinMaxDistribution
idlong0.0%313bar chart
namestring0.0%3AliceCharliebar chart
emailstring0.0%3alice@…charlie@…bar chart
  • Nulls % is color-coded: green under 10%, amber 10-50%, red above 50%
  • Distinct uses compact number formatting (1.2K, 3.5M)
  • Min/Max come from SQL MIN(CAST(col AS VARCHAR)) / MAX(...)
  • Distribution is a Recharts sparkline showing top-N value frequencies with hover tooltips

Two-phase profiling

Phase 1: Single aggregation query (fast)

One SQL query computes stats for ALL columns at once:

SELECT COUNT(*) AS "_total",
COUNT("id") AS "id_non_null",
COUNT(DISTINCT "id") AS "id_distinct",
MIN(CAST("id" AS VARCHAR)) AS "id_min",
MAX(CAST("id" AS VARCHAR)) AS "id_max",
COUNT("name") AS "name_non_null",
-- ... for all columns
FROM "main_warehouse"."analytics_db"."user_summary"

This gives null counts, distinct counts, min/max for every column in a single round-trip. The frontend calculates null percentage from total - non_null.

Phase 2: Distribution queries (parallel)

For each column (max 10), a separate top-N query runs in parallel:

SELECT CAST("name" AS VARCHAR) AS val, COUNT(*) AS cnt
FROM "main_warehouse"."analytics_db"."user_summary"
WHERE "name" IS NOT NULL
GROUP BY "name"
ORDER BY cnt DESC
LIMIT 8

The results feed into Recharts BarChart sparklines. Tiny 60x20px charts with opacity encoding (more frequent = more opaque) and hover tooltips showing value: count.

Zero-cost manifest stats

Before clicking Profile, the schema table tries to extract stats from the Iceberg manifest metadata: null_count, distinct_count, lower_bound, upper_bound per column. These are free (no query needed) but depend on whether the Iceberg writer collected column statistics. Many writers don’t, so these are often empty.

The Profile button fills the gaps with actual SQL.

Security: SQL identifier escaping

Column names and table identifiers are user-derived (from catalog metadata). We escape them properly:

function escapeIdentifier(name: string): string {
return '"' + name.replace(/"/g, '""') + '"';
}

This prevents SQL injection from column names containing quotes, semicolons, or other special characters. The same escaping applies to catalog, namespace, and table names in the fully-qualified table reference.

Stats cards

Above the schema table, four summary cards show:

  • Columns total count
  • Rows from manifest total-records
  • Data Files from manifest total-data-files
  • Format Version for the Iceberg format

These come from the manifest metadata (zero cost) and appear immediately without clicking Profile.


Column profiling was added in April 2026 as part of the Data Explorer enhancements.

All posts