Inspector Context Stack¶
This page explains how inspector data moves from raw profiling into the higher-level context that AI and catalog surfaces consume.
The short version¶
The inspector does not produce a single "context" object. It builds context in layers:
- live schema and profiling data
- persisted table profiles in
target/inspect.json - cross-table enrichments such as relationships and fanout risk
- description enrichments from dbt metadata
- AI-focused formatting into compact schema context text
Each layer serves a different consumer and intentionally keeps the boundaries clean.
Stack overview¶
flowchart TD
A["Live database schema<br/>InspectConnection"] --> B["Raw table profile<br/>TableInspector / TableInspection"]
B --> C["Artifact cache<br/>target/inspect.json"]
C --> D["Catalog enrichments<br/>relationships, join profiles, fanout risk"]
C --> E["Description enrichments<br/>dbt_descriptions"]
D --> F["AI formatting<br/>format_table_context()"]
E --> F
C --> F
F --> G["Schema context string<br/>get_schema_context()"]
C --> H["MCP catalog() and inspect dashboards"]
Layer 1: Live schema and profiling¶
The lowest layer is direct database access:
InspectConnectionhandles dialect-specific metadata queries and table access.TableInspector.inspect_table()runs the profiling pipeline for one table.
This is where Dataface computes the raw per-column facts:
- nulls
- distinct counts
- min/max
- numeric stats
- top values
- enum values
- semantic type detection
- quality flags
- primary date column
- grain candidate
This layer is table-local. It knows a lot about one table, but not yet about the wider catalog.
Layer 2: Raw profile contract¶
The output of a profiling run is a TableInspection. That object serializes
into the stable profiler contract via to_dict() / to_json_dict().
Important characteristics of this layer:
- it is the canonical raw profile representation
- it is designed to be reused across CLI, IDE, and API surfaces
- it remains close to the underlying facts rather than AI prompt formatting
The main contract reference lives in:
The broader artifact shape and field semantics are documented in:
Layer 3: Persisted artifact in target/inspect.json¶
InspectionStorage writes table profiles into a single artifact:
This is the shared cache for inspector-driven features. It gives Dataface a stable, queryable representation of the catalog without having to re-profile on every request.
Key properties:
- one file for the whole catalog
- one entry per table
- merges new profiles into existing data
- acts as the handoff point to downstream consumers
Layer 4: Cross-table enrichment¶
Some context only makes sense once multiple tables have been profiled. That work happens after the raw table profiles are saved.
Relationships¶
InspectionStorage.update_relationships() reconstructs table profiles from the
cache and runs cross-table relationship inference.
That pipeline is intentionally cache-based:
- it does not hit the database again
- it reasons over already-profiled metadata
- it produces deterministic catalog-level edges
The main signals are:
- foreign-key naming conventions such as
customer_id key_roleclassifications- uniqueness ratios
- FK range containment within PK range
Join multiplicity and fanout risk¶
Detected relationships are then enriched with:
join_profilefanout_risk
This is what allows downstream consumers to distinguish:
- safe dimension lookups
- one-to-many joins
- risky many-to-many patterns
That enrichment is important because it turns a guessed relationship into a usable modeling hint.
Layer 5: Description enrichment¶
Descriptions are currently baked from dbt metadata into cached profiles via
InspectionStorage.update_descriptions().
That step parses:
and stores matched descriptions under dbt_descriptions.
This is deliberately provenance-preserving:
- dbt descriptions are stored with source metadata
- they do not overwrite profiler facts
- higher layers can choose how to merge or prioritize descriptions
Layer 6: AI context shaping¶
The AI-facing layer lives in dataface/ai/schema_context.py.
This layer is not about discovering new facts. It is about taking the cached and enriched facts and turning them into something an LLM can consume efficiently.
There are two important functions:
format_table_context(table)produces a structured AI payload for one tableget_schema_context()produces a compact multi-table schema summary string
What format_table_context() adds¶
It takes a table profile and returns:
ai_context_versionformattedselected_descriptionselected_sourcedescription_candidatescolumn_descriptions
This is a different contract from the raw profiler contract. The profiler contract is about profiling output. The AI context contract is about prompt-safe consumption.
The AI contract reference lives in:
Description merging¶
format_table_context() supports a generalized description_candidates model
and resolves it through the description merge engine.
The current priority stack is:
dbt_schema_ymldatabase_commentcuratedinferred
Today, much of the persisted inspector enrichment is still stored as
dbt_descriptions in the cache artifact, while the AI layer is already built
to support the more general description_candidates contract. That is worth
knowing when reasoning about where a specific description came from.
Layer 7: Consumer-specific views¶
Once the stack above exists, different surfaces take different slices of it.
catalog()¶
catalog() exposes an AI-friendly browsing surface:
- table listing uses cached profiles when available
- cache misses fall back to live schema introspection
- single-table deep profiling is opt-in via
force_refresh=True
The list response is intentionally slimmer than the raw profile contract. It keeps only the fields that help with exploration.
Inspect dashboards¶
Inspect dashboards use the cached artifact and inspect templates to render a UI over the same underlying profile data.
AI prompts and playground¶
get_schema_context() produces a compact text summary of the schema. It prefers
cached profiles, but if a table has never been profiled it can still include
the live column list so AI flows do not fail hard on a cold start.
Boundary rules¶
The main design rule is that each layer owns a different job:
- inspector/profile layer owns factual table metadata
- storage layer owns persistence and catalog-wide baking
- relationship layer owns cross-table reasoning
- description layer owns provenance
- AI context layer owns compact formatting and merge presentation
That separation is what keeps the system extensible. If a new consumer needs the raw contract, it can stop at the artifact layer. If it needs LLM-friendly text, it can consume the AI context layer instead.