RAG engineer at a B2B SaaS
A mid-level RAG engineer who shipped a retrieval system over internal support tickets. The bullets that land share a pattern: every retrieval metric has a baseline, every model choice has a reason, every latency number names the p95. Where they slip is the framework-list bullet, which gets rewritten in front of you.
Maya Rao
Education
Experience
- Shipped the company's first production retrieval system over 220k internal support tickets; hybrid retrieval (BM25 + BGE-large) with a Cohere reranker on the top 50; Recall@10 improved from 0.61 BM25 baseline to 0.87 on a 500-query labeled eval; p95 query latency under 400ms with HNSW indexing in pgvector.
- Authored the team's first labeled retrieval eval set: 500 queries with relevance graded by 3 senior support engineers on a 4-point scale; harness runs nightly and gates 100% of retriever changes.
- Ran a 2,000-query embedding-model bake-off across BGE-large, text-embedding-3-large, and a quantized BGE-small; kept BGE-large for the high-recall path and routed long-tail queries to BGE-small, cutting monthly embedding cost by 34% with no measurable Recall@10 regression.
- Used vector databases and LLMs to build search.
- Cut p95 retrieval latency from 920ms to 380ms by switching from a flat IVF index to HNSW and moving rerank batching off the hot path; held a Recall@10 regression budget of <1.5% and shipped at 0.4%.
- Built the chunk-strategy A/B harness: recursive 512-token with 64-token overlap beat fixed 256-token and full-doc on Recall@5 by 11 points; documented the test and removed three competing chunkers from the codebase.
- Owned the retrieval on-call rotation for 2 quarters across 38 incidents; median time to detect 4 minutes, median time to mitigate 22 minutes; wrote the retrieval-quality runbook now used by 2 adjacent teams.
- Built the first ranking model for the support search surface: gradient-boosted ranker on 14M historical click pairs; nDCG@10 from 0.42 to 0.58 on a 1,200-query holdout; served at 6K req/s on a single instance with 110ms p95.
- Migrated the embedding pipeline from a managed API to a self-hosted Sentence-Transformers serving stack on Triton; cut monthly inference cost from $11.4k to $1.8k with no measurable Recall@5 regression on the production eval.
- Worked closely with the product team on improving search relevance.
- Designed the feature store for the ranking model on top of a Postgres + Redis split; cut feature-fetch p95 from 78ms to 14ms and removed 4 hot-path API calls from the search path.
- Authored the team's offline-vs-online eval harness; surfaced a 9% offline-online gap caused by stale features and shipped a 30-minute refresh cadence that closed the gap to 1.2%.
Projects
- 1,100 queries with 4-point relevance labels over a public support corpus; used by 6 third-party teams as a reproducible retrieval benchmark; cited in 2 internal write-ups at vendor companies.
Technical Skills
RAG resumes do not win on framework names. They win on the corpus size, the retrieval metric with a baseline, the latency budget, and one defensible system design choice. Everything else is decoration.



