Document Metadata
| Project Name | AI-Powered Real Estate Search Platform |
| Doc Type | Technical Design Document (TDD) |
| Audience | Hiring Manager, Senior Engineer, Architect |
| Status | Active |
| Last Updated | 2026-01-13 |
| Owner | Jordan Allen |
| Scope | Covers conversational search agent, preference wizard, LLM-planned search pipeline, multi-modal embeddings, taste learning, match scoring, collaboration. Excludes: Agent-facing CRM, listing management, MLS data ingestion pipeline. |
| Codebase | Key paths: /app/lib/search, /app/lib/scout, /app/lib/matching, /app/lib/wizard, /src/mastra |
1. Problem Statement
Home buyers spend months browsing listings through filter interfaces, yet can't articulate what they want in checkbox form. The core problem: discovery is mismatched to how preferences actually work.
Specific Problems
- Filter explosion: Stack enough constraints and you get zero results; relax them and you're overwhelmed
- Preferences emerge through exposure: A buyer thinks they need 4 bedrooms until they see a brilliantly designed 3-bedroom
- No learning: Viewing 200 listings teaches the system nothing—each session starts fresh
- Listings don't answer real questions: "Will this work for remote work?" → "4 bed / 3 bath"
- Visual preferences are inexpressible: "Modern but warm, not sterile" has no filter
Before / After
| Before | After |
|---|---|
| 47 listings viewed before shortlist | 12 listings viewed (74% reduction) |
| 4.2 min to first relevant result | 1.3 min (3.2x faster) |
| Every session starts from scratch | System learns and improves with each interaction |
| "Modern home" returns random results | 89% semantic query accuracy |
2. Goals and Non-Goals
2.1 Goals
- Enable natural language property search with semantic understanding
- Learn buyer preferences from both explicit feedback and implicit behavior
- Deliver explainable recommendations that users can interrogate
- Provide sub-second search latency for interactive refinement
- Support multi-stakeholder collaboration (couples, families, agents)
- Scale to full MLS inventory (millions of listings)
2.2 Non-Goals
- Not a CRM for agents — focused on buyer experience, not lead management
- Not a listing platform — consumes MLS data, doesn't manage listings
- Not a transaction system — stops at discovery, no offers/contracts
- Not a mortgage calculator — basic affordability only, no loan origination
- Not optimizing for UI polish in v1 — function over form initially
Phase Scope
| Phase | Included | Excluded |
|---|---|---|
| v1 | Search, wizard, matching, Scout agent | Voice input, real-time streaming |
| v1.5 | Collaboration, comparison sessions | Agent marketplace |
| v2 | MLS integrations, alerts | Offer management |
3. System Overview
The system comprises four layers with distinct responsibilities:
3.1 High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ Next.js 15 + React 19 + Tailwind + Radix UI │
│ Scout Chat • Wizard • Property Cards • Comparison Trays │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ AGENT LAYER (Mastra + LLM) │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Scout │ │ Planner │ │ Judge │ │
│ │ Agent │ │ Agent │ │ Agent │ │
│ │ (12 tools) │ │ (QueryIR) │ │ (QA loop) │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Match │ │ Property │ │
│ │ Scorer │ │ Explainer │ │
│ └───────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ RETRIEVAL LAYER │
│ Elasticsearch 9.x: BM25 + kNN + Script Scoring │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Description │ │ Amenity │ │ Location │ │
│ │ Vectors │ │ Vectors │ │ Vectors │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Image │ │ RRF │ │
│ │ Vectors │ │ Fusion │ │
│ └───────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ PostgreSQL (Drizzle ORM) • Redis (sessions) • GCS (images) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Properties │ │ User │ │ Scout │ │
│ │ + Vectors │ │ Profiles │ │ Memory │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
3.2 Core Subsystems
| Subsystem | Responsibility | Key Files |
|---|---|---|
| Preference Wizard | Captures 100+ profile fields via guided flow | /app/lib/wizard/ |
| LLM-Planned Search | Query → Classification → Planning → Execution → QA | /app/lib/search/ |
| Scout Agent | Conversational interface with 12 tools | /app/lib/scout/ |
| Match Scorer | 8-category weighted scoring (0-100) | /app/lib/matching/ |
| Taste Learning | Event capture + feature aggregation + MMR | /app/lib/scout/tools/personalization/ |
3.3 Request Flow
- User completes wizard → Profile stored with computed weights
- User queries Scout: "Modern homes with good light under $800K"
- Heuristic check: Simple query? Skip LLM classification (saves 500ms)
- Planner Agent: Generates QueryChips + vector weights
- Vector generation: 4 embeddings (description, amenity, location, image)
- ES hybrid search: BM25 + kNN + script scoring
- Judge Agent: Evaluates top 3 results; if quality < 0.6, revise query
- Taste blending: MMR re-rank with user preference vector
- Match scoring: 8-category breakdown per property
- Results returned with explanations
4. Architecture Overview
4.1 Major Components
| Component | Responsibility | Inputs | Outputs | Owns |
|---|---|---|---|---|
| Preference Wizard | Structured preference capture | User answers | Profile + weights | Question flow, validation |
| Query Classifier | Complexity + intent detection | Raw query | Complexity score, intent | Classification heuristics |
| Planner Agent | Query → structured IR | Query + context | QueryChips + weights | Chip schema |
| Vector Generator | Text/image → embeddings | Text, image URLs | 1024-dim vectors | Embedding API calls |
| ES Query Builder | IR → Elasticsearch DSL | QueryIR + vectors | ES query | Query construction |
| Judge Agent | Result quality evaluation | Query + top results | Quality score, revisions | Evaluation criteria |
| Scout Agent | Conversational interface | User message + scope | Response + actions | Tool orchestration, memory |
| Match Scorer | Property-profile alignment | Property + profile | Score 0-100 + breakdown | 8 sub-scorers |
| Taste Engine | Preference learning | User events | Taste vector | Event aggregation, decay |
4.2 Multi-Modal Embedding Architecture
Four distinct embedding spaces capture different property aspects:
| Space | Dimensions | Source | Captures |
|---|---|---|---|
| Description | 1024 | Listing text | Style, condition, lifestyle fit |
| Amenity | 1024 | Feature list | Kitchen quality, garage, pool |
| Location | 1024 | Address + enrichment | Walkability, schools, commute |
| Image | 1024 | Property photos | Visual style, light, condition |
Fusion: RRF (Reciprocal Rank Fusion) with 60% text weight, 40% image weight.
4.3 Communication Patterns
Sync: API → Agent → Tools → Database (request-response within 2s target)
Async: Scout memory persistence, taste event logging (fire-and-forget)
Retry: LLM calls have 3 retries with exponential backoff; ES queries timeout at 5s
4.4 LLM-Planned Search Sequence
┌──────┐ ┌───────────┐ ┌─────────┐ ┌───────────┐
│ User │────▶│ Classifier│────▶│ Planner │────▶│ Validator │
└──────┘ └───────────┘ └─────────┘ └───────────┘
│
┌──────┐ ┌───────────┐ ┌─────────┐ ┌────▼──────┐
│Result│◀────│ Judge │◀────│ ES │◀────│ Vectors │
└──────┘ │ (QA loop) │ │ Search │ │ Generator │
└───────────┘ └─────────┘ └───────────┘
If Judge scores results < 0.6, revision handler adjusts query and retries (max 2 iterations).
5. Key Design Decisions
Decision Index
| ID | Decision | Area | Status |
|---|---|---|---|
| D1 | RRF over linear fusion | Search | Adopted |
| D2 | 4 embedding spaces | Retrieval | Adopted |
| D3 | Heuristic-first classification | Latency | Adopted |
| D4 | 8-category match scoring | Explainability | Adopted |
| D5 | Event-based taste learning | Personalization | Adopted |
| D6 | Scope-based agent memory | Context | Adopted |
D1: RRF over Linear Fusion
- Context: Need to combine BM25 lexical scores with kNN vector scores
- Alternatives: Linear weighted sum, learned fusion weights, interleaving
- Chosen: Reciprocal Rank Fusion (RRF) with k=60
- Why: RRF is position-based and robust to score distribution variance. Doesn't require normalization. Well-tested in production search systems.
- Tradeoffs: Can't tune importance as precisely as learned weights; ignores score magnitude
D2: Four Embedding Spaces
- Context: Properties have multiple semantic axes (text, visuals, location, features)
- Alternatives: Single unified embedding, late fusion only
- Chosen: 4 separate 1024-dim embeddings + RRF fusion
- Why: Different embedding models excel at different domains. Allows per-axis weighting based on query type.
- Tradeoffs: 4x embedding cost; more complex indexing; harder to debug
D3: Heuristic-First Classification
- Context: Most queries are simple ("homes under 500k in Seattle") but LLM classification adds 500ms
- Alternatives: Always classify via LLM, rule-based only
- Chosen: Heuristic check first; skip LLM if confidence > 90%
- Why: 70% of queries are simple. Saves 500ms latency for majority case.
- Tradeoffs: May misclassify edge cases; heuristics need maintenance
D4: 8-Category Match Scoring
- Context: Users need to understand why a property matches (or doesn't)
- Alternatives: Single score, 3-tier (good/okay/poor), vector similarity only
- Chosen: 8 weighted categories: Budget, Structure, Location, Schools, Lifestyle, Visual, Investment, Policy
- Why: Maps to how buyers actually think. Enables filtering by category. Supports partial matches.
- Tradeoffs: Complex weight tuning; users may disagree with category importance
D5: Event-Based Taste Learning
- Context: Preferences should improve without forcing explicit feedback
- Alternatives: Explicit ratings only, collaborative filtering
- Chosen: Capture all events (save=1.0, hide=-0.5, view=0.1) + recency decay + feature aggregation
- Why: Rich signal without friction. Adapts to evolving taste.
- Tradeoffs: Cold start problem; noisy signals from accidental clicks
D6: Scope-Based Agent Memory
- Context: Scout needs different context when discussing a specific property vs. general search
- Alternatives: Single global thread, ephemeral memory, topic detection
- Chosen: 6 scopes (global, property, collection, area, compare, tour) with separate threads
- Why: Clean context separation. No cross-contamination. Enables scope-specific prompts.
- Tradeoffs: Can't reference across scopes; more threads to manage
6. Code-Level Mapping
6.1 Directory Structure
/app
/lib
/search # LLM-Planned Search V2
llm-planned-orchestrator.ts # Main pipeline
query-classifier.ts # Complexity detection
planner-agent.ts # QueryIR generation
chip-validator.ts # Chip validation
es-query-builder.ts # ES DSL construction
judge-agent.ts # Result QA
revision-handler.ts # Query revision loop
vector-service.ts # Embedding generation
fusion.ts # RRF implementation
visual-cues.ts # Image query detection
/scout # Conversational Agent
agent.ts # Mastra agent definition
tools/ # 12 tool implementations
actions/properties.ts # save/hide/note
personalization/taste.ts # taste vector
personalization/rank.ts # MMR blending
/matching # Match Scoring
match-scorer.ts # 8-category scorer
sub-score-calculators.ts # Category implementations
/wizard # Preference Wizard
types.ts # Question schemas
database-sync.ts # Profile persistence
/db
schema/ # Drizzle schemas
profiles.ts # User profile (100+ fields)
properties.ts # Property + vectors
scout.ts # Scout threads/messages
collaboration.ts # Comments/sessions
/src
/mastra
index.ts # Mastra configuration
tools.ts # Tool definitions
6.2 Key File Responsibilities
| File | Lines | Responsibility |
|---|---|---|
llm-planned-orchestrator.ts | ~800 | Orchestrates 8-stage search pipeline |
agent.ts | ~300 | Scout agent with Mastra memory + tools |
match-scorer.ts | ~600 | Computes 0-100 match with breakdown |
taste.ts | ~200 | Event aggregation + feature extraction |
vector-service.ts | ~400 | Text + multimodal embedding API calls |
es-query-builder.ts | ~500 | Builds ES DSL from QueryIR |
6.3 Key Interfaces
// QueryIR - Intermediate Representation
interface QueryIR {
chips: QueryChip[]; // Extracted search parameters
vectors: QueryVectors; // 4 embedding types
weights: WeightProfile; // Per-vector importance
hardFilters: ESFilter[]; // Must-match constraints
softPreferences: string[]; // Nice-to-have features
}
// ScoutTasteEvent - Preference Signal
interface ScoutTasteEvent {
kind: 'view' | 'save' | 'hide' | 'shortlist' | 'note';
propertyId: string;
weight: number; // Action importance
meta: PropertyMeta; // Extracted features
createdAt: Date;
}
// MatchResult - Scoring Output
interface MatchResult {
score: number; // 0-100 overall
breakdown: CategoryScore[]; // 8 categories
reasons: string[]; // Human-readable explanations
}
7. Failure Modes & Edge Cases
7.1 Search Failures
| Failure | Detection | Mitigation | User Impact |
|---|---|---|---|
| Empty results with filters | Result count = 0 | Progressive filter relaxation | Wider but relevant results |
| Visual query for vacant land | Property type = land | 95% image weight penalty | Correct ranking |
| Negated queries ("no pool") | NOT chip detected | Convert to dealBreaker | Hard exclusion |
| LLM planning timeout | >5s response | Fall back to keyword search | Degraded but working |
| Judge rejects all results | Max revisions reached | Return best-effort + warning | "May not match intent" |
7.2 Agent Failures
| Failure | Detection | Mitigation | User Impact |
|---|---|---|---|
| Tool execution timeout | >10s per tool | Return partial + error | "Action incomplete" |
| Memory context overflow | >127k tokens | TokenLimiter processor | Older context trimmed |
| LLM rate limit | 429 response | Backoff + fallback model | Slower but functional |
| Idempotency violation | Duplicate cmdId | Return cached result | No double-save |
7.3 Data Integrity
- Orphaned events: Property deleted but taste events remain → Ignored in aggregation
- Stale embeddings: Property updated but vectors unchanged → Nightly re-embedding
- Profile mismatch: Profile updated mid-session → Re-fetch on next search
8. Tradeoffs & Constraints
8.1 Speed vs. Accuracy
The system is optimized for perceived responsiveness over theoretical optimality:
| Decision | Rationale |
|---|---|
| Heuristic-first classification | Most queries are simple ("homes in Greenville under 400k"). Skip LLM classification when regex + keyword detection suffices. |
| Single-pass planning | Revision loops only trigger when Judge detects poor results. Most queries succeed on first pass. |
| RRF over learned fusion | RRF with k=60 is robust across query types without training data. Learned fusion would require labeled datasets we don't have. |
| Pre-computed image embeddings | Multimodal embedding at ingest time, not query time. Trades freshness for latency. |
8.2 UX vs. Engineering Cost
| Feature | UX Win | Cost | Decision |
|---|---|---|---|
| 8-category scoring | Explainable matches ("87% match because...") | 8 separate computations per property | Worth it |
| Streaming responses | Perceived speed during agent responses | SSE infrastructure | Worth it |
| Undo capability | User confidence to experiment | Event sourcing for reversibility | Worth it |
| Real-time collaboration | Couples search together | WebSocket infrastructure | Deferred (polling for now) |
8.3 Constraints Accepted
- Cold start: New users get generic results until wizard completion or interaction history builds. The wizard mitigates this by front-loading preference capture.
- Photo quality bias: Professional photography scores higher in visual matching. This reflects buyer perception reality.
- Context window ceiling: Agent memory uses Mastra's 127k token limit. Very old conversation context gets trimmed by the TokenLimiter processor.
- Online-only: All search and scoring requires network. Acceptable for the target use case (active home search).
8.4 Technical Debt
- Hardcoded RRF weights (60/40): Works well empirically but not A/B tested. Abstraction exists for future tuning.
- Single embedding provider: Voyage AI only. Interface abstraction exists for provider swapping.
- Polling-based collaboration: Comments and sessions use polling. WebSocket upgrade planned for real-time sync.
9. Security, Safety & Misuse
9.1 Authentication & Authorization
| Layer | Mechanism |
|---|---|
| Identity provider | Stytch B2B (magic links + OAuth) |
| Session management | JWT access tokens with refresh rotation |
| API authorization | Bearer token validation on all routes |
| Data isolation | PostgreSQL row-level security policies per user |
9.2 Data Boundaries
- Agent memory isolation: Each user's conversation history and preferences are scoped by userId. The agent cannot access other users' data.
- MLS compliance: Listing data is cached locally per MLS terms. No redistribution or scraping.
- Profile data: Wizard answers and taste events stored in user-scoped tables with RLS.
9.3 Agent Safety
- Tool scope: Agent tools can only modify the authenticated user's data (saves, hides, notes).
- No external actions: Agent cannot send emails, make API calls to external services, or access filesystem.
- Conversation context: System prompt is fixed; user messages cannot override agent instructions.
9.4 Fair Housing Compliance
- No protected class filtering: Search does not filter by race, religion, familial status, or other protected classes.
- School data disclosure: School ratings are shown for transparency but not used for algorithmic steering.
- Budget-based pricing: Price recommendations based on stated budget range, not demographic inference.
10. Observability & Debug
10.1 What Gets Logged
| Data | Where | Why |
|---|---|---|
| API requests/errors | Vercel logs | Error tracking and debugging |
| Search pipeline traces | Console (dev) | See each stage: classify → plan → validate → execute → judge |
| Agent conversation turns | PostgreSQL (messages table) | Conversation replay for debugging and memory |
10.2 Debug Playbook
| Symptom | First Check | Common Cause |
|---|---|---|
| Empty search results | ES query structure in logs | Over-constrained filters from planner |
| Slow agent responses | Tool execution duration | Database N+1 or slow ES query |
| Inconsistent match scores | Profile freshness | Stale cached profile data |
| Agent confusion | Context window usage | Old context trimmed, missing relevant history |
10.3 Search Trace Structure
Each search request logs a trace showing pipeline stage durations:
{
traceId: "abc123",
pipeline: "llm-planned-v2",
stages: [
{ name: "classify", result: "complex" },
{ name: "plan", filters: 3, semantic: true },
{ name: "validate", passed: true },
{ name: "execute", hits: 47 },
{ name: "judge", revision: false }
]
}
When revision: true, the Judge triggered a re-plan, and stages repeat.
11. Evolution & Open Questions
11.1 Planned Improvements
- Real-time collaboration: WebSocket upgrade to replace polling for comments and comparison sessions. Currently functional but not real-time.
- Learned fusion weights: A/B test infrastructure to tune RRF weights per query type instead of fixed 60/40.
- Multi-agent routing: Specialist agents for specific tasks (tour planning, investment analysis) with automatic routing.
11.2 Open Technical Questions
- Cross-user taste transfer: Can users with similar profiles bootstrap cold-start faster? Requires privacy-preserving similarity computation.
- Explanation fidelity: Match score reasons are generated post-hoc. How do we verify they reflect actual model behavior?
- Preference stability: When should the system resist preference drift vs. adapt? A user viewing 10 ranches doesn't necessarily want ranches.
- Late vs. early fusion: RRF fuses text and image results post-retrieval. Would jointly-trained embeddings perform better?
11.3 Known Limitations
- US-only: MLS data access is US-specific. International would require different data sources.
- Residential focus: Commercial properties would need different embedding strategy and scoring dimensions.
- English only: LLM prompts and agent instructions are English. Internationalization is possible but not implemented.
12. Appendix
A. Search Evaluation Progression
Automated eval suite tracks search quality across 5 persona-based test cases (Family, Investor, Luxury, First-Time, Remote Worker). Results from a single development iteration:
| Iteration | Pass Rate | MedianOverall@10 | Constraint Violations | Key Fix |
|---|---|---|---|---|
| 1 | 0% | 0.0 | 744 | Baseline - planner generating invalid ES queries |
| 2 | 0% | 0.0 | 0 | Fixed constraint violations (valid queries, no results) |
| 3 | 20% | 58.0 | 0 | Added semantic boosting, Luxury case passes |
| 4 | 100% | 94.4 | 0 | Tuned filter relaxation, all cases pass |
Metrics explained:
- MedianOverall@10: Median match score of top 10 results (0-100 scale from 8-category scorer)
- Constraint Violations: Properties returned that violate hard constraints (e.g., over budget, wrong city)
- Pass threshold: MedianOverall@10 ≥ 70, zero constraint violations
B. Test Cases
| Persona | Query | Final MedianOverall |
|---|---|---|
| Family Modern | "modern family home with updated kitchen near good schools" | 93.1 |
| Investor | "rental property with good cash flow potential" | 93.1 |
| Luxury | "luxury home with high-end finishes and gourmet kitchen" | 99.2 |
| First-Time | "move-in ready starter home good value" | 93.1 |
| Remote Worker | "quiet home with dedicated office space and good internet" | 93.1 |
C. Glossary
| Term | Definition |
|---|---|
| RRF | Reciprocal Rank Fusion - method to combine ranked lists from different retrievers |
| MedianOverall@10 | Median match score of top 10 search results |
| SearchPlan | Structured representation compiled from user profile + wizard answers |
| TasteEvent | User action (save, hide, view, dwell) that signals preference |
| Scope | Agent conversation context (global, property, collection, area, compare, tour) |
| Judge | LLM that evaluates search results and decides if revision is needed |