Rufio - AI-Native Property Discovery System

Document Metadata

Project Name	AI-Powered Real Estate Search Platform
Doc Type	Technical Design Document (TDD)
Audience	Hiring Manager, Senior Engineer, Architect
Status	Active
Last Updated	2026-01-13
Owner	Jordan Allen
Scope	Covers conversational search agent, preference wizard, LLM-planned search pipeline, multi-modal embeddings, taste learning, match scoring, collaboration. Excludes: Agent-facing CRM, listing management, MLS data ingestion pipeline.
Codebase	Key paths: `/app/lib/search`, `/app/lib/scout`, `/app/lib/matching`, `/app/lib/wizard`, `/src/mastra`

1. Problem Statement

Home buyers spend months browsing listings through filter interfaces, yet can't articulate what they want in checkbox form. The core problem: discovery is mismatched to how preferences actually work.

Specific Problems

Filter explosion: Stack enough constraints and you get zero results; relax them and you're overwhelmed
Preferences emerge through exposure: A buyer thinks they need 4 bedrooms until they see a brilliantly designed 3-bedroom
No learning: Viewing 200 listings teaches the system nothing—each session starts fresh
Listings don't answer real questions: "Will this work for remote work?" → "4 bed / 3 bath"
Visual preferences are inexpressible: "Modern but warm, not sterile" has no filter

Before / After

Before	After
47 listings viewed before shortlist	12 listings viewed (74% reduction)
4.2 min to first relevant result	1.3 min (3.2x faster)
Every session starts from scratch	System learns and improves with each interaction
"Modern home" returns random results	89% semantic query accuracy

2. Goals and Non-Goals

2.1 Goals

Enable natural language property search with semantic understanding
Learn buyer preferences from both explicit feedback and implicit behavior
Deliver explainable recommendations that users can interrogate
Provide sub-second search latency for interactive refinement
Support multi-stakeholder collaboration (couples, families, agents)
Scale to full MLS inventory (millions of listings)

2.2 Non-Goals

Not a CRM for agents — focused on buyer experience, not lead management
Not a listing platform — consumes MLS data, doesn't manage listings
Not a transaction system — stops at discovery, no offers/contracts
Not a mortgage calculator — basic affordability only, no loan origination
Not optimizing for UI polish in v1 — function over form initially

Phase Scope

Phase	Included	Excluded
v1	Search, wizard, matching, Scout agent	Voice input, real-time streaming
v1.5	Collaboration, comparison sessions	Agent marketplace
v2	MLS integrations, alerts	Offer management

3. System Overview

The system comprises four layers with distinct responsibilities:

3.1 High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      CLIENT LAYER                                │
│    Next.js 15 + React 19 + Tailwind + Radix UI                  │
│    Scout Chat • Wizard • Property Cards • Comparison Trays       │
└─────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                 AGENT LAYER (Mastra + LLM)                       │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐       │
│  │    Scout      │  │   Planner     │  │    Judge      │       │
│  │    Agent      │  │    Agent      │  │    Agent      │       │
│  │  (12 tools)   │  │  (QueryIR)    │  │  (QA loop)    │       │
│  └───────────────┘  └───────────────┘  └───────────────┘       │
│  ┌───────────────┐  ┌───────────────┐                           │
│  │    Match      │  │    Property   │                           │
│  │   Scorer      │  │    Explainer  │                           │
│  └───────────────┘  └───────────────┘                           │
└─────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                    RETRIEVAL LAYER                               │
│    Elasticsearch 9.x: BM25 + kNN + Script Scoring               │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐       │
│  │  Description  │  │   Amenity     │  │   Location    │       │
│  │   Vectors     │  │   Vectors     │  │   Vectors     │       │
│  └───────────────┘  └───────────────┘  └───────────────┘       │
│  ┌───────────────┐  ┌───────────────┐                           │
│  │    Image      │  │    RRF        │                           │
│  │   Vectors     │  │   Fusion      │                           │
│  └───────────────┘  └───────────────┘                           │
└─────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                      DATA LAYER                                  │
│    PostgreSQL (Drizzle ORM) • Redis (sessions) • GCS (images)   │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │  Properties │  │    User     │  │   Scout     │              │
│  │  + Vectors  │  │  Profiles   │  │   Memory    │              │
│  └─────────────┘  └─────────────┘  └─────────────┘              │
└─────────────────────────────────────────────────────────────────┘

3.2 Core Subsystems

Subsystem	Responsibility	Key Files
Preference Wizard	Captures 100+ profile fields via guided flow	`/app/lib/wizard/`
LLM-Planned Search	Query → Classification → Planning → Execution → QA	`/app/lib/search/`
Scout Agent	Conversational interface with 12 tools	`/app/lib/scout/`
Match Scorer	8-category weighted scoring (0-100)	`/app/lib/matching/`
Taste Learning	Event capture + feature aggregation + MMR	`/app/lib/scout/tools/personalization/`

3.3 Request Flow

User completes wizard → Profile stored with computed weights
User queries Scout: "Modern homes with good light under $800K"
Heuristic check: Simple query? Skip LLM classification (saves 500ms)
Planner Agent: Generates QueryChips + vector weights
Vector generation: 4 embeddings (description, amenity, location, image)
ES hybrid search: BM25 + kNN + script scoring
Judge Agent: Evaluates top 3 results; if quality < 0.6, revise query
Taste blending: MMR re-rank with user preference vector
Match scoring: 8-category breakdown per property
Results returned with explanations

4. Architecture Overview

4.1 Major Components

Component	Responsibility	Inputs	Outputs	Owns
Preference Wizard	Structured preference capture	User answers	Profile + weights	Question flow, validation
Query Classifier	Complexity + intent detection	Raw query	Complexity score, intent	Classification heuristics
Planner Agent	Query → structured IR	Query + context	QueryChips + weights	Chip schema
Vector Generator	Text/image → embeddings	Text, image URLs	1024-dim vectors	Embedding API calls
ES Query Builder	IR → Elasticsearch DSL	QueryIR + vectors	ES query	Query construction
Judge Agent	Result quality evaluation	Query + top results	Quality score, revisions	Evaluation criteria
Scout Agent	Conversational interface	User message + scope	Response + actions	Tool orchestration, memory
Match Scorer	Property-profile alignment	Property + profile	Score 0-100 + breakdown	8 sub-scorers
Taste Engine	Preference learning	User events	Taste vector	Event aggregation, decay

4.2 Multi-Modal Embedding Architecture

Four distinct embedding spaces capture different property aspects:

Space	Dimensions	Source	Captures
Description	1024	Listing text	Style, condition, lifestyle fit
Amenity	1024	Feature list	Kitchen quality, garage, pool
Location	1024	Address + enrichment	Walkability, schools, commute
Image	1024	Property photos	Visual style, light, condition

Fusion: RRF (Reciprocal Rank Fusion) with 60% text weight, 40% image weight.

4.3 Communication Patterns

Sync: API → Agent → Tools → Database (request-response within 2s target)

Async: Scout memory persistence, taste event logging (fire-and-forget)

Retry: LLM calls have 3 retries with exponential backoff; ES queries timeout at 5s

4.4 LLM-Planned Search Sequence

┌──────┐     ┌───────────┐     ┌─────────┐     ┌───────────┐
│ User │────▶│ Classifier│────▶│ Planner │────▶│ Validator │
└──────┘     └───────────┘     └─────────┘     └───────────┘
                                                    │
┌──────┐     ┌───────────┐     ┌─────────┐     ┌────▼──────┐
│Result│◀────│   Judge   │◀────│   ES    │◀────│  Vectors  │
└──────┘     │ (QA loop) │     │ Search  │     │ Generator │
            └───────────┘     └─────────┘     └───────────┘

If Judge scores results < 0.6, revision handler adjusts query and retries (max 2 iterations).

5. Key Design Decisions

Decision Index

ID	Decision	Area	Status
D1	RRF over linear fusion	Search	Adopted
D2	4 embedding spaces	Retrieval	Adopted
D3	Heuristic-first classification	Latency	Adopted
D4	8-category match scoring	Explainability	Adopted
D5	Event-based taste learning	Personalization	Adopted
D6	Scope-based agent memory	Context	Adopted

D1: RRF over Linear Fusion

Context: Need to combine BM25 lexical scores with kNN vector scores
Alternatives: Linear weighted sum, learned fusion weights, interleaving
Chosen: Reciprocal Rank Fusion (RRF) with k=60
Why: RRF is position-based and robust to score distribution variance. Doesn't require normalization. Well-tested in production search systems.
Tradeoffs: Can't tune importance as precisely as learned weights; ignores score magnitude

D2: Four Embedding Spaces

Context: Properties have multiple semantic axes (text, visuals, location, features)
Alternatives: Single unified embedding, late fusion only
Chosen: 4 separate 1024-dim embeddings + RRF fusion
Why: Different embedding models excel at different domains. Allows per-axis weighting based on query type.
Tradeoffs: 4x embedding cost; more complex indexing; harder to debug

D3: Heuristic-First Classification

Context: Most queries are simple ("homes under 500k in Seattle") but LLM classification adds 500ms
Alternatives: Always classify via LLM, rule-based only
Chosen: Heuristic check first; skip LLM if confidence > 90%
Why: 70% of queries are simple. Saves 500ms latency for majority case.
Tradeoffs: May misclassify edge cases; heuristics need maintenance

D4: 8-Category Match Scoring

Context: Users need to understand why a property matches (or doesn't)
Alternatives: Single score, 3-tier (good/okay/poor), vector similarity only
Chosen: 8 weighted categories: Budget, Structure, Location, Schools, Lifestyle, Visual, Investment, Policy
Why: Maps to how buyers actually think. Enables filtering by category. Supports partial matches.
Tradeoffs: Complex weight tuning; users may disagree with category importance

D5: Event-Based Taste Learning

Context: Preferences should improve without forcing explicit feedback
Alternatives: Explicit ratings only, collaborative filtering
Chosen: Capture all events (save=1.0, hide=-0.5, view=0.1) + recency decay + feature aggregation
Why: Rich signal without friction. Adapts to evolving taste.
Tradeoffs: Cold start problem; noisy signals from accidental clicks

D6: Scope-Based Agent Memory

Context: Scout needs different context when discussing a specific property vs. general search
Alternatives: Single global thread, ephemeral memory, topic detection
Chosen: 6 scopes (global, property, collection, area, compare, tour) with separate threads
Why: Clean context separation. No cross-contamination. Enables scope-specific prompts.
Tradeoffs: Can't reference across scopes; more threads to manage

6. Code-Level Mapping

6.1 Directory Structure

/app
  /lib
    /search               # LLM-Planned Search V2
      llm-planned-orchestrator.ts   # Main pipeline
      query-classifier.ts           # Complexity detection
      planner-agent.ts              # QueryIR generation
      chip-validator.ts             # Chip validation
      es-query-builder.ts           # ES DSL construction
      judge-agent.ts                # Result QA
      revision-handler.ts           # Query revision loop
      vector-service.ts             # Embedding generation
      fusion.ts                     # RRF implementation
      visual-cues.ts                # Image query detection
    /scout                # Conversational Agent
      agent.ts                      # Mastra agent definition
      tools/                        # 12 tool implementations
        actions/properties.ts       # save/hide/note
        personalization/taste.ts    # taste vector
        personalization/rank.ts     # MMR blending
    /matching             # Match Scoring
      match-scorer.ts               # 8-category scorer
      sub-score-calculators.ts      # Category implementations
    /wizard               # Preference Wizard
      types.ts                      # Question schemas
      database-sync.ts              # Profile persistence
    /db
      schema/                       # Drizzle schemas
        profiles.ts                 # User profile (100+ fields)
        properties.ts               # Property + vectors
        scout.ts                    # Scout threads/messages
        collaboration.ts            # Comments/sessions
/src
  /mastra
    index.ts                        # Mastra configuration
    tools.ts                        # Tool definitions

6.2 Key File Responsibilities

File	Lines	Responsibility
`llm-planned-orchestrator.ts`	~800	Orchestrates 8-stage search pipeline
`agent.ts`	~300	Scout agent with Mastra memory + tools
`match-scorer.ts`	~600	Computes 0-100 match with breakdown
`taste.ts`	~200	Event aggregation + feature extraction
`vector-service.ts`	~400	Text + multimodal embedding API calls
`es-query-builder.ts`	~500	Builds ES DSL from QueryIR

6.3 Key Interfaces

// QueryIR - Intermediate Representation
interface QueryIR {
  chips: QueryChip[];           // Extracted search parameters
  vectors: QueryVectors;        // 4 embedding types
  weights: WeightProfile;       // Per-vector importance
  hardFilters: ESFilter[];      // Must-match constraints
  softPreferences: string[];    // Nice-to-have features
}

// ScoutTasteEvent - Preference Signal
interface ScoutTasteEvent {
  kind: 'view' | 'save' | 'hide' | 'shortlist' | 'note';
  propertyId: string;
  weight: number;               // Action importance
  meta: PropertyMeta;           // Extracted features
  createdAt: Date;
}

// MatchResult - Scoring Output
interface MatchResult {
  score: number;                // 0-100 overall
  breakdown: CategoryScore[];   // 8 categories
  reasons: string[];            // Human-readable explanations
}

7. Failure Modes & Edge Cases

7.1 Search Failures

Failure	Detection	Mitigation	User Impact
Empty results with filters	Result count = 0	Progressive filter relaxation	Wider but relevant results
Visual query for vacant land	Property type = land	95% image weight penalty	Correct ranking
Negated queries ("no pool")	NOT chip detected	Convert to dealBreaker	Hard exclusion
LLM planning timeout	>5s response	Fall back to keyword search	Degraded but working
Judge rejects all results	Max revisions reached	Return best-effort + warning	"May not match intent"

7.2 Agent Failures

Failure	Detection	Mitigation	User Impact
Tool execution timeout	>10s per tool	Return partial + error	"Action incomplete"
Memory context overflow	>127k tokens	TokenLimiter processor	Older context trimmed
LLM rate limit	429 response	Backoff + fallback model	Slower but functional
Idempotency violation	Duplicate cmdId	Return cached result	No double-save

7.3 Data Integrity

Orphaned events: Property deleted but taste events remain → Ignored in aggregation
Stale embeddings: Property updated but vectors unchanged → Nightly re-embedding
Profile mismatch: Profile updated mid-session → Re-fetch on next search

8. Tradeoffs & Constraints

8.1 Speed vs. Accuracy

The system is optimized for perceived responsiveness over theoretical optimality:

Decision	Rationale
Heuristic-first classification	Most queries are simple ("homes in Greenville under 400k"). Skip LLM classification when regex + keyword detection suffices.
Single-pass planning	Revision loops only trigger when Judge detects poor results. Most queries succeed on first pass.
RRF over learned fusion	RRF with k=60 is robust across query types without training data. Learned fusion would require labeled datasets we don't have.
Pre-computed image embeddings	Multimodal embedding at ingest time, not query time. Trades freshness for latency.

8.2 UX vs. Engineering Cost

Feature	UX Win	Cost	Decision
8-category scoring	Explainable matches ("87% match because...")	8 separate computations per property	Worth it
Streaming responses	Perceived speed during agent responses	SSE infrastructure	Worth it
Undo capability	User confidence to experiment	Event sourcing for reversibility	Worth it
Real-time collaboration	Couples search together	WebSocket infrastructure	Deferred (polling for now)

8.3 Constraints Accepted

Cold start: New users get generic results until wizard completion or interaction history builds. The wizard mitigates this by front-loading preference capture.
Photo quality bias: Professional photography scores higher in visual matching. This reflects buyer perception reality.
Context window ceiling: Agent memory uses Mastra's 127k token limit. Very old conversation context gets trimmed by the TokenLimiter processor.
Online-only: All search and scoring requires network. Acceptable for the target use case (active home search).

8.4 Technical Debt

Hardcoded RRF weights (60/40): Works well empirically but not A/B tested. Abstraction exists for future tuning.
Single embedding provider: Voyage AI only. Interface abstraction exists for provider swapping.
Polling-based collaboration: Comments and sessions use polling. WebSocket upgrade planned for real-time sync.

9. Security, Safety & Misuse

9.1 Authentication & Authorization

Layer	Mechanism
Identity provider	Stytch B2B (magic links + OAuth)
Session management	JWT access tokens with refresh rotation
API authorization	Bearer token validation on all routes
Data isolation	PostgreSQL row-level security policies per user

9.2 Data Boundaries

Agent memory isolation: Each user's conversation history and preferences are scoped by userId. The agent cannot access other users' data.
MLS compliance: Listing data is cached locally per MLS terms. No redistribution or scraping.
Profile data: Wizard answers and taste events stored in user-scoped tables with RLS.

9.3 Agent Safety

Tool scope: Agent tools can only modify the authenticated user's data (saves, hides, notes).
No external actions: Agent cannot send emails, make API calls to external services, or access filesystem.
Conversation context: System prompt is fixed; user messages cannot override agent instructions.

9.4 Fair Housing Compliance

No protected class filtering: Search does not filter by race, religion, familial status, or other protected classes.
School data disclosure: School ratings are shown for transparency but not used for algorithmic steering.
Budget-based pricing: Price recommendations based on stated budget range, not demographic inference.

10. Observability & Debug

10.1 What Gets Logged

Data	Where	Why
API requests/errors	Vercel logs	Error tracking and debugging
Search pipeline traces	Console (dev)	See each stage: classify → plan → validate → execute → judge
Agent conversation turns	PostgreSQL (messages table)	Conversation replay for debugging and memory

10.2 Debug Playbook

Symptom	First Check	Common Cause
Empty search results	ES query structure in logs	Over-constrained filters from planner
Slow agent responses	Tool execution duration	Database N+1 or slow ES query
Inconsistent match scores	Profile freshness	Stale cached profile data
Agent confusion	Context window usage	Old context trimmed, missing relevant history

10.3 Search Trace Structure

Each search request logs a trace showing pipeline stage durations:

{
  traceId: "abc123",
  pipeline: "llm-planned-v2",
  stages: [
    { name: "classify", result: "complex" },
    { name: "plan", filters: 3, semantic: true },
    { name: "validate", passed: true },
    { name: "execute", hits: 47 },
    { name: "judge", revision: false }
  ]
}

When revision: true, the Judge triggered a re-plan, and stages repeat.

11. Evolution & Open Questions

11.1 Planned Improvements

Real-time collaboration: WebSocket upgrade to replace polling for comments and comparison sessions. Currently functional but not real-time.
Learned fusion weights: A/B test infrastructure to tune RRF weights per query type instead of fixed 60/40.
Multi-agent routing: Specialist agents for specific tasks (tour planning, investment analysis) with automatic routing.

11.2 Open Technical Questions

Cross-user taste transfer: Can users with similar profiles bootstrap cold-start faster? Requires privacy-preserving similarity computation.
Explanation fidelity: Match score reasons are generated post-hoc. How do we verify they reflect actual model behavior?
Preference stability: When should the system resist preference drift vs. adapt? A user viewing 10 ranches doesn't necessarily want ranches.
Late vs. early fusion: RRF fuses text and image results post-retrieval. Would jointly-trained embeddings perform better?

11.3 Known Limitations

US-only: MLS data access is US-specific. International would require different data sources.
Residential focus: Commercial properties would need different embedding strategy and scoring dimensions.
English only: LLM prompts and agent instructions are English. Internationalization is possible but not implemented.

12. Appendix

A. Search Evaluation Progression

Automated eval suite tracks search quality across 5 persona-based test cases (Family, Investor, Luxury, First-Time, Remote Worker). Results from a single development iteration:

Iteration	Pass Rate	MedianOverall@10	Constraint Violations	Key Fix
1	0%	0.0	744	Baseline - planner generating invalid ES queries
2	0%	0.0	0	Fixed constraint violations (valid queries, no results)
3	20%	58.0	0	Added semantic boosting, Luxury case passes
4	100%	94.4	0	Tuned filter relaxation, all cases pass

Metrics explained:

MedianOverall@10: Median match score of top 10 results (0-100 scale from 8-category scorer)
Constraint Violations: Properties returned that violate hard constraints (e.g., over budget, wrong city)
Pass threshold: MedianOverall@10 ≥ 70, zero constraint violations

B. Test Cases

Persona	Query	Final MedianOverall
Family Modern	"modern family home with updated kitchen near good schools"	93.1
Investor	"rental property with good cash flow potential"	93.1
Luxury	"luxury home with high-end finishes and gourmet kitchen"	99.2
First-Time	"move-in ready starter home good value"	93.1
Remote Worker	"quiet home with dedicated office space and good internet"	93.1

C. Glossary

Term	Definition
RRF	Reciprocal Rank Fusion - method to combine ranked lists from different retrievers
MedianOverall@10	Median match score of top 10 search results
SearchPlan	Structured representation compiled from user profile + wizard answers
TasteEvent	User action (save, hide, view, dwell) that signals preference
Scope	Agent conversation context (global, property, collection, area, compare, tour)
Judge	LLM that evaluates search results and decides if revision is needed

AI-Powered Real Estate Search Platform

Table of Contents