This week, we're doing something a little different on the Zuma blog. 👀
Instead of our property management insights, we wanted to give you a peek under the hood at the technology that powers our AI leasing agent, Kelsey.
A few weeks ago, our Head of Machine Learning, Sid Ramesh, shared an article on Medium sharing the challenges and lessons learned while deploying large language models (LLMs) in an enterprise setting.
We thought this would be a perfect opportunity to share his expertise with our community and provide a deeper understanding of how Zuma's technology works!
In "The Reality of Deploying LLMs in Enterprise: 3 Lessons from the Field," Sid breaks down the critical insights we've gained about transparency, evaluation metrics, and retrieval strategies.
These lessons have shaped how we build trustworthy, efficient AI systems that deliver real value to property management teams.
What makes Zuma unique is our Human + AI Hybrid Approach. We're the only platform combining AI and human expertise for 24/7 lead conversion, which helps ensure all interactions with Kelsey are great and reduces handoffs to property management teams.
This hybrid approach is built on the technological foundation that Sid discusses in his article.
–
Introduction
At Zuma, our mission is to revolutionize property management by automating mundane tasks in leasing operations, enabling teams to focus on building meaningful in-person relationships with residents. Our AI leasing agent, Kelsey, interacts with prospective renters — answering questions, providing quotes, booking tours, and generally streamlining the leasing process. As Head of ML, my role is to ensure Kelsey is not only intelligent but also trustworthy, efficient, and scalable.
In late 2023, we re-architected Kelsey with a large language model (LLM) architecture, aiming to improve recall and deliver more natural, human-like interactions. Within weeks of launching to real customers in early 2024, we encountered a wave of escalations, exposing several hard truths about running LLMs in production. Our ability to move this fast wasn’t just by design — it was the advantage of being a lean, agile startup, where time to production is measured in days, not months. This rapid deployment forced us to rethink our approaches to transparency, evaluation, and retrieval strategies. Through this journey, we uncovered three pivotal lessons that reshaped how we build AI-powered workflows.
- Transparency Matters — In enterprise AI, trust is paramount. Every piece of information must be traceable to its source, necessitating a robust data infrastructure that inherently supports transparency.
- LLM Evaluations Are Domain-Specific — Conventional NLP metrics like precision and recall fail to capture business risks adequately. While misidentifying flooring is one concern, quoting an incorrect rent price is a far graver issue. Evaluations must reflect real-world stakes.
- RAG Is Overrated — Although retrieval-augmented generation (RAG) assists with long-tail queries, we found that structured data and clean APIs were far more effective for common questions. Implementing small, dynamic prompts outperformed more complex retrieval systems.
These insights have fundamentally reshaped our approach to building AI-powered workflows at Zuma. In this post, I will delve into each lesson — exploring its significance and offering guidance to teams deploying LLMs in enterprise settings to avoid similar challenges.
Let’s dive in.
Lesson 1: Transparency Matters
When deploying an LLM in an enterprise setting, transparency isn’t just a nice-to-have — it’s a requirement. The standard of accuracy is 95%! Trust is everything, and customers expect AI-driven workflows to provide clear, verifiable answers, not just confident-sounding responses. In property management, where lease terms, pricing, and policies impact real financial decisions, even a slight ambiguity can lead to frustration, lost deals, or legal liability.
The Challenge: LLMs as Black Boxes
One of the first issues we encountered after launching our LLM-powered Kelsey was that it wasn’t always clear where the AI was pulling its information from. If a customer asked, “What’s the pet policy?” or “What’s the rent for a 2-bedroom?”, the AI would generate a response — but how could we guarantee it was accurate? More importantly, how could we prove to our customers where that answer came from?
LLMs are notorious for hallucinating information, which meant that even when Kelsey gave a confident response, we needed a way to trace every piece of information back to its source. Without this, leasing teams couldn’t trust the AI, and customers had no reason to trust it either.
The Solution: Data Infrastructure Built for Traceability
To solve this, we redesigned our data pipeline to prioritize transparent sourcing at every step. This meant:
- Structured Data First — Instead of relying on the LLM to generate free-form answers, we structured as much of our data as possible through clean APIs that pulled directly from property management systems. This ensured that key details — like pricing, availability, and policies — were always grounded in authoritative sources. There were challenges because the systems we pulled directly themselves did not have clean data, so we invested resources into allowing our platform to make it easy to curate clean data for our AI to use .
- Citations in Responses — Whenever Kelsey provided an answer, we include a reference to where that information came from. This could be a specific unit, or the internal tool the LLM understood (eg: pet-policy-question).
- Hybrid Retrieval Approaches — While Retrieval-Augmented Generation (RAG) can be useful, we found that simpler approaches — like structured lookups and well-defined data schemas — outperformed RAG for transparency. When structured data wasn’t enough, we leveraged RAG only for long-tail, unstructured queries, keeping the core experience as deterministic as possible. We then worked on a clean UX to continually add to the long tail of knowledge.
The Takeaway: Trust Comes From Traceability
Transparency isn’t just about what the AI says — it’s about being able to prove why it said it. In enterprise AI deployments, trust is earned, not assumed, and that trust is built on clear, verifiable data pipelines.
For any team deploying LLMs in high-stakes business settings, ask yourself:
✅ Can I trace every AI-generated response back to a source?
✅ Does my data infrastructure support transparency by design?
✅ Am I using LLMs where they make sense, or am I overcomplicating retrieval?
By designing AI workflows that prioritize traceability, structured data, and clear sourcing, we turned Kelsey from an AI that sounded smart into one that leasing teams could actually trust.
Lesson 2: LLM Evaluations Are Domain-Specific
One of the biggest misconceptions about deploying LLMs in enterprise settings is that standard NLP evaluation metrics — like precision, recall, and BLEU scores — are enough to measure performance. The reality? These metrics fail to capture real-world business risks.
In property management, not all mistakes carry the same weight. If Kelsey misidentifies the type of flooring in a unit, it’s an inconvenience. But if Kelsey quotes the wrong rent price, it’s a potential legal and financial disaster.
This meant that evaluating our LLM’s performance wasn’t just about accuracy — it was about impact. We had to rethink how we measured success in a way that aligned with business priorities.
The Challenge: Standard NLP Metrics Don’t Capture Business Risk
Initially, we evaluated Kelsey’s performance using traditional NLP benchmarks — precision, recall, F1 score, and even embedding similarity for retrieval-based answers. However, we quickly ran into a major issue:
🔹 A 90% accuracy score meant nothing if the 10% error rate included critical mistakes.
🔹 A perfect response in an academic NLP benchmark could still be useless to a leasing agent.
🔹 Some mistakes were negligible, while others had serious consequences — but our metrics treated them the same.
For example, Kelsey might:
✔️ Correctly retrieve a unit’s square footage (low business risk)
✔️ Correctly retrieve the building’s pet policy (moderate business risk)
❌ Quote the wrong rent price (high business risk)
From a model evaluation perspective, all three responses were just “correct” or “incorrect.” But from a business perspective, one of those mistakes was exponentially worse than the others.
That’s when we realized: our evaluation metrics needed to reflect the real-world stakes of each mistake.
The Solution: Risk-Weighted Evaluations
To build an evaluation system that aligned with business risk, we designed a custom risk-weighted framework that categorized model outputs into three tiers:
🔹 Low-Risk Mistakes (Minor inaccuracies, non-critical details)
- Example: Misstating flooring type
- Impact: Mild inconvenience
- Evaluation Handling: Standard NLP metrics were fine here.
🔸 Medium-Risk Mistakes (Important but non-fatal errors)
- Example: Incorrectly stating pet fees or existence of in-unit washer-dryers
- Impact: Potential friction but fixable
- Evaluation Handling: Flagged for human review in post-processing.
🔴 High-Risk Mistakes (Critical errors that affect pricing, lease terms, or legal policies)
- Example: Quoting an incorrect rent price
- Impact: Financial or legal liability
- Evaluation Handling: Automated hard constraints — Kelsey could NOT answer without structured data.
By assigning different weights to different mistake types, we ensured that our evaluation metrics didn’t just measure correctness — they measured business-critical correctness.
Key Adjustments We Made
1️⃣ Confidence Scoring for Every Response
Rather than blindly trusting the model’s output, we had Kelsey generate a confidence score for each response. This allowed us to dynamically assess whether the AI was certain enough to respond autonomously or if additional validation was needed.
2️⃣ Requiring the Model to Output Reasoning
Every response wasn’t just an answer — it included an explanation of how the model arrived at that conclusion. This reasoning exposed which data sources the model used and the logical steps it took, making it easier to debug errors and ensure alignment with real-world policies.
3️⃣ Post-Processing Validation for Critical Responses
For high-risk outputs (like rent prices and lease terms), we didn’t rely solely on the LLM’s response. Before sending an answer, we cross-checked it against our existing property data. If the generated answer didn’t match the authoritative source, it was either corrected or discarded.
4️⃣ Human-in-the-Loop for Uncertain Responses
When the AI determined that an answer wasn’t reliable enough to send, it didn’t guess. Instead, it flagged the response for human review, ensuring that uncertain or high-impact queries were handled correctly without unnecessary automation risks.
The Takeaway: Evaluations Must Reflect Real-World Impact
If you’re deploying an LLM in high-stakes enterprise applications, ask yourself:
✅ Are you treating all errors equally, or are you weighing high-risk mistakes more heavily?
✅ Do your evaluation metrics align with actual business impact?
✅ Are you using traditional NLP benchmarks without considering real-world consequences?
At Zuma, redefining our evaluation strategy wasn’t just about improving accuracy — it was about minimizing business risk.
By building a risk-aware evaluation framework, we ensured that Kelsey doesn’t just perform well in benchmarks — it performs well where it actually matters.
Lesson 3: RAG Is Overrated — Until It’s Not
When we first deployed Kelsey, we leaned heavily on Retrieval-Augmented Generation (RAG) for everything except pricing. Our assumption was that RAG would allow Kelsey to dynamically pull from leasing policies, FAQs, internal documentation, and web sources, ensuring accurate and flexible responses.
But in practice, over-relying on RAG introduced more problems than it solved:
1️⃣ Conflicting sources created inconsistent answers.
- A manually entered document might list one pet policy, while a web-scraped version said something slightly different.
- If a leasing team uploaded a document yesterday, but the website hadn’t been updated for months, which one should Kelsey trust?
2️⃣ Thresholding answers wasn’t universal.
- We initially tried to filter out low-confidence RAG responses, but confidence scores weren’t normalized across different customer datasets.
- Some customers had clean, structured knowledge bases, while others had incomplete or messy data — making confidence scores unreliable.
3️⃣ Retrieval lacked interpretability, making debugging harder.
- When Kelsey fetched structured data, we could immediately see what went wrong in a lookup.
- But when Kelsey retrieved a chunk of text from RAG, debugging was more difficult — was the problem in retrieval, document quality, or LLM interpretation?
We realized that RAG, while useful, was not the best tool for handling frequent, structured questions — we needed a hybrid approach.
The Solution: Hybrid Retrieval with Structured Lookups
To solve these challenges, we re-architected our retrieval strategy to prioritize accuracy, consistency, and interpretability by separating structured lookups and RAG.
✅ Structured Lookups as the Primary Source of Truth
For high-frequency, structured questions (e.g., pet policies, parking, lease terms), we stopped using RAG and built structured API lookups instead. This improved:
✔️ Speed — Lookups are direct, eliminating unnecessary retrieval steps.
✔️ Accuracy — Structured queries pull from a single authoritative source instead of guessing.
✔️ Consistency — Responses don’t vary depending on retrieval success.
✅ RAG Only Uses Manual Sources
To prevent conflicts and unpredictability, we limited RAG to manual sources only.
- RAG does not retrieve from automated sources (PMS integration, web scrapes, etc.).
- Customers must upload documents manually for them to be used in RAG retrieval.
- No reconciliation is performed on RAG outputs — it pulls from a single manual document.
This prevented RAG from returning inconsistent or outdated data and ensured that manual retrieval remained clear, predictable, and controllable.
✅ Exposing Metadata for Leasing Teams
- While prospects only see the final answer, leasing teams can view:
- Which source was used (manual document, PMS, or web scrape).
- Last update timestamp of the data.
- Confidence score & retrieval details.
- This helps teams debug responses and manually override them when needed.
✅ Making Retrieval More Interpretable
✔️ Structured lookups are easier to debug — Failed API queries are binary (success/fail).
✔️ RAG responses remain transparent — Leasing teams can see what documents were retrieved to understand why the AI responded a certain way.
Why This Works
🔹 Lookups are deterministic — If conflicts exist, we handle them systematically.
🔹 RAG remains predictable — Since it only retrieves from a single manual source, it never pulls contradictory data.
🔹 Customers control automation — They can override automated responses or manually select trusted sources.
Conclusion: Building Practical, Trustworthy AI for Enterprise
Deploying LLMs in enterprise settings is not just a technical challenge — it’s a trust challenge. At Zuma, our journey of re-architecting Kelsey taught us that accuracy alone is not enough. AI must be transparent, aligned with business risk, and backed by a solid retrieval strategy.
Our key takeaways:
✅ Transparency is essential — AI-generated responses must be traceable, ensuring that both customers and internal teams understand the source of truth.
✅ Evaluations must reflect business risk — Not all mistakes are equal. High-impact errors require domain-specific evaluation metrics, not just standard NLP benchmarks.
✅ RAG is a tool, not a default — For high-frequency, structured queries, structured lookups outperform RAG in speed, reliability, and interpretability.
Looking ahead, we believe that enterprise AI must be built with flexibility — allowing teams to balance automation with human oversight, structured data with retrieval, and scalability with trust.
If you’re deploying LLM-powered systems in an enterprise setting, we’d love to hear your experiences. How are you tackling transparency, evaluation, and retrieval in your AI workflows?