Skip to content

Inspector Query Costs

This document explains what database queries the inspector runs and how to estimate the cost based on your table structure.

Configuration

Control which queries run via dataface.yml or default_config.yml:

# In your project's dataface.yml
inspector:
  collect_top_values: true       # Default: true (smart filtering)
  top_values_limit: 10           # Default: 10
  top_values_max_cardinality: 1000  # Skip columns with more distinct values

  # Types to SKIP for top_values (numeric, date, uuid, etc.)
  top_values_skip_types:
    - INTEGER
    - BIGINT
    - FLOAT
    - DOUBLE
    - DATE
    - TIMESTAMP
    - UUID
    # ... see default_config.yml for full list

  collect_enum_values: true      # Default: true
  enum_cardinality_threshold: 20 # Default: 20
  sample_size: 100               # Default: 100

Smart top_values filtering: By default, top_values is enabled but only runs for string-like columns (VARCHAR, TEXT, CHAR) with reasonable cardinality (< 1000 distinct values). Numeric types, dates, UUIDs, and high-cardinality columns are automatically skipped since value distributions aren't useful.

Or access programmatically:

from dataface.inspect import get_inspector_config

config = get_inspector_config()
print(config.collect_top_values)  # True
print(config.top_values_skip_types)  # ['INTEGER', 'BIGINT', ...]

Query Summary

Query Type Count Default Config Setting
Schema 1 ✅ On Always runs
Stats scan 1-N ✅ On Always runs
Sample data 1 ✅ On sample_size
Top values C ❌ Off collect_top_values
Enum values E ✅ On collect_enum_values

With defaults: Total queries = 2 + ceil(C / max_cols) + E

With top_values: Total queries = 2 + ceil(C / max_cols) + C + E

Where: - C = total number of columns - E = number of enum-like columns (distinct count ≤ 20) - max_cols = column limit per stats query (varies by database)

Detailed Breakdown

1. Schema Query (1 query)

-- Get column names and types
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'your_table'

Cost: Very fast, reads metadata only.

2. Stats Scan (1+ queries)

Single table scan computing aggregate stats for all columns:

SELECT
  COUNT(*) as total_rows,
  -- Per column (repeated for each column):
  COUNT(col1) as col1__non_null_count,
  COUNT(DISTINCT col1) as col1__distinct_count,
  MIN(col1) as col1__min,
  MAX(col1) as col1__max,
  AVG(col1) as col1__mean,        -- numeric only
  STDDEV(col1) as col1__stddev,   -- numeric only
  -- ... more columns
FROM your_table

Cost: Full table scan. For wide tables, may be split into multiple queries.

Column limits per query (to avoid SELECT column limits):

Database Max Columns per Query
DuckDB 1400
BigQuery 1400
SQLite 1400
MySQL 500
SQL Server 500
PostgreSQL 200
Redshift 200
Snowflake 140
Oracle 140

Example: A PostgreSQL table with 450 columns = ceil(450/200) = 3 stats queries

3. Sample Data Query (1 query)

SELECT * FROM your_table LIMIT 100

Cost: Very fast, reads 100 rows for semantic type detection.

4. Top Values Queries (C queries)

One query per column to get value frequencies:

SELECT col1, COUNT(*) as frequency
FROM your_table
GROUP BY col1
ORDER BY frequency DESC
LIMIT 10

Cost: Each requires a GROUP BY scan. This is the most expensive part for tables with many columns.

5. Enum Values Queries (E queries)

Only runs for columns detected as enum-like (distinct count ≤ 20):

SELECT DISTINCT col1 FROM your_table

Cost: Fast since these columns have low cardinality.

Cost Calculator

Formula

Total Queries = 1 (schema)
              + ceil(C / max_cols) (stats scans)
              + 1 (sample)
              + C (top values)
              + E (enum values)

Examples

Small Table: 10 columns, 2 enums, PostgreSQL

Queries = 1 + ceil(10/200) + 1 + 10 + 2
        = 1 + 1 + 1 + 10 + 2
        = 15 queries

Medium Table: 50 columns, 5 enums, DuckDB

Queries = 1 + ceil(50/1400) + 1 + 50 + 5
        = 1 + 1 + 1 + 50 + 5
        = 58 queries

Wide Table: 500 columns, 20 enums, PostgreSQL

Queries = 1 + ceil(500/200) + 1 + 500 + 20
        = 1 + 3 + 1 + 500 + 20
        = 525 queries

Performance Tips

  1. Wide tables are expensive - The top_values queries (one per column) dominate cost for tables with many columns.

  2. DuckDB is most efficient - Higher column limits mean fewer stats queries, plus fast in-memory execution.

  3. Enum detection adds queries - If many columns have low cardinality, more enum queries run.

  4. Consider column filtering - Future: ability to profile only specific columns.

Future Additions

Issue #281 will add:

Query Type Count Description
Histograms N_numeric Binned distribution for numeric columns
Date distributions N_date Monthly/weekly counts for date columns

This will increase queries by approximately: - N_numeric = number of numeric columns (INTEGER, FLOAT, DOUBLE, etc.) - N_date = number of date/timestamp columns