Databricks-Style Column Profiling in the Data Explorer
Inline column statistics with distribution sparklines computed from real SQL queries, not just Iceberg manifest metadata.
Inline column statistics with distribution sparklines in the table schema view. Computed from real SQL queries, not just Iceberg manifest metadata.
The feature
When you select a table in the Data Explorer and click Profile, every column gets inline statistics:
| Column | Type | Nulls % | Distinct | Min | Max | Distribution |
|---|---|---|---|---|---|---|
| id | long | 0.0% | 3 | 1 | 3 | bar chart |
| name | string | 0.0% | 3 | Alice | Charlie | bar chart |
| string | 0.0% | 3 | alice@… | charlie@… | bar chart |
- Nulls % is color-coded: green under 10%, amber 10-50%, red above 50%
- Distinct uses compact number formatting (1.2K, 3.5M)
- Min/Max come from SQL
MIN(CAST(col AS VARCHAR))/MAX(...) - Distribution is a Recharts sparkline showing top-N value frequencies with hover tooltips
Two-phase profiling
Phase 1: Single aggregation query (fast)
One SQL query computes stats for ALL columns at once:
SELECT COUNT(*) AS "_total", COUNT("id") AS "id_non_null", COUNT(DISTINCT "id") AS "id_distinct", MIN(CAST("id" AS VARCHAR)) AS "id_min", MAX(CAST("id" AS VARCHAR)) AS "id_max", COUNT("name") AS "name_non_null", -- ... for all columnsFROM "main_warehouse"."analytics_db"."user_summary"This gives null counts, distinct counts, min/max for every column in a single round-trip. The frontend calculates null percentage from total - non_null.
Phase 2: Distribution queries (parallel)
For each column (max 10), a separate top-N query runs in parallel:
SELECT CAST("name" AS VARCHAR) AS val, COUNT(*) AS cntFROM "main_warehouse"."analytics_db"."user_summary"WHERE "name" IS NOT NULLGROUP BY "name"ORDER BY cnt DESCLIMIT 8The results feed into Recharts BarChart sparklines. Tiny 60x20px charts with opacity encoding (more frequent = more opaque) and hover tooltips showing value: count.
Zero-cost manifest stats
Before clicking Profile, the schema table tries to extract stats from the Iceberg manifest metadata: null_count, distinct_count, lower_bound, upper_bound per column. These are free (no query needed) but depend on whether the Iceberg writer collected column statistics. Many writers don’t, so these are often empty.
The Profile button fills the gaps with actual SQL.
Security: SQL identifier escaping
Column names and table identifiers are user-derived (from catalog metadata). We escape them properly:
function escapeIdentifier(name: string): string { return '"' + name.replace(/"/g, '""') + '"';}This prevents SQL injection from column names containing quotes, semicolons, or other special characters. The same escaping applies to catalog, namespace, and table names in the fully-qualified table reference.
Stats cards
Above the schema table, four summary cards show:
- Columns total count
- Rows from manifest
total-records - Data Files from manifest
total-data-files - Format Version for the Iceberg format
These come from the manifest metadata (zero cost) and appear immediately without clicking Profile.
Column profiling was added in April 2026 as part of the Data Explorer enhancements.