Inspector Query Costs¶
This document explains what database queries the inspector runs and how to estimate the cost based on your table structure.
Configuration¶
Control which queries run via dataface.yml or default_config.yml:
# In your project's dataface.yml inspector: collect_top_values: true # Default: true (smart filtering) top_values_limit: 10 # Default: 10 top_values_max_cardinality: 1000 # Skip columns with more distinct values # Types to SKIP for top_values (numeric, date, uuid, etc.) top_values_skip_types: - INTEGER - BIGINT - FLOAT - DOUBLE - DATE - TIMESTAMP - UUID # ... see default_config.yml for full list collect_enum_values: true # Default: true enum_cardinality_threshold: 20 # Default: 20 sample_size: 100 # Default: 100
Smart top_values filtering: By default, top_values is enabled but only runs for string-like columns (VARCHAR, TEXT, CHAR) with reasonable cardinality (< 1000 distinct values). Numeric types, dates, UUIDs, and high-cardinality columns are automatically skipped since value distributions aren't useful.
Or access programmatically:
from dataface.inspect import get_inspector_config
config = get_inspector_config()
print(config.collect_top_values) # True
print(config.top_values_skip_types) # ['INTEGER', 'BIGINT', ...]
Query Summary¶
| Query Type | Count | Default | Config Setting |
|---|---|---|---|
| Schema | 1 | ✅ On | Always runs |
| Stats scan | 1-N | ✅ On | Always runs |
| Sample data | 1 | ✅ On | sample_size |
| Top values | C | ❌ Off | collect_top_values |
| Enum values | E | ✅ On | collect_enum_values |
With defaults: Total queries = 2 + ceil(C / max_cols) + E
With top_values: Total queries = 2 + ceil(C / max_cols) + C + E
Where: - C = total number of columns - E = number of enum-like columns (distinct count ≤ 20) - max_cols = column limit per stats query (varies by database)
Detailed Breakdown¶
1. Schema Query (1 query)¶
-- Get column names and types
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'your_table'
Cost: Very fast, reads metadata only.
2. Stats Scan (1+ queries)¶
Single table scan computing aggregate stats for all columns:
SELECT
COUNT(*) as total_rows,
-- Per column (repeated for each column):
COUNT(col1) as col1__non_null_count,
COUNT(DISTINCT col1) as col1__distinct_count,
MIN(col1) as col1__min,
MAX(col1) as col1__max,
AVG(col1) as col1__mean, -- numeric only
STDDEV(col1) as col1__stddev, -- numeric only
-- ... more columns
FROM your_table
Cost: Full table scan. For wide tables, may be split into multiple queries.
Column limits per query (to avoid SELECT column limits):
| Database | Max Columns per Query |
|---|---|
| DuckDB | 1400 |
| BigQuery | 1400 |
| SQLite | 1400 |
| MySQL | 500 |
| SQL Server | 500 |
| PostgreSQL | 200 |
| Redshift | 200 |
| Snowflake | 140 |
| Oracle | 140 |
Example: A PostgreSQL table with 450 columns = ceil(450/200) = 3 stats queries
3. Sample Data Query (1 query)¶
Cost: Very fast, reads 100 rows for semantic type detection.
4. Top Values Queries (C queries)¶
One query per column to get value frequencies:
Cost: Each requires a GROUP BY scan. This is the most expensive part for tables with many columns.
5. Enum Values Queries (E queries)¶
Only runs for columns detected as enum-like (distinct count ≤ 20):
Cost: Fast since these columns have low cardinality.
Cost Calculator¶
Formula¶
Total Queries = 1 (schema)
+ ceil(C / max_cols) (stats scans)
+ 1 (sample)
+ C (top values)
+ E (enum values)
Examples¶
Small Table: 10 columns, 2 enums, PostgreSQL¶
Medium Table: 50 columns, 5 enums, DuckDB¶
Wide Table: 500 columns, 20 enums, PostgreSQL¶
Performance Tips¶
-
Wide tables are expensive - The top_values queries (one per column) dominate cost for tables with many columns.
-
DuckDB is most efficient - Higher column limits mean fewer stats queries, plus fast in-memory execution.
-
Enum detection adds queries - If many columns have low cardinality, more enum queries run.
-
Consider column filtering - Future: ability to profile only specific columns.
Future Additions¶
Issue #281 will add:
| Query Type | Count | Description |
|---|---|---|
| Histograms | N_numeric | Binned distribution for numeric columns |
| Date distributions | N_date | Monthly/weekly counts for date columns |
This will increase queries by approximately: - N_numeric = number of numeric columns (INTEGER, FLOAT, DOUBLE, etc.) - N_date = number of date/timestamp columns