Resume Writing•15 min read•Apr 22, 2026

GenAI, LLM, and RAG resume bullets, marked up by hand

Seven GenAI resume bullets every senior reviewer mentally skips, and the rewrites that survive a real conversation about your work.

Thejus Sunny

Engineering + hiring perspective

I review a lot of GenAI resumes lately. Volume has roughly tripled in the last year. And the strangest thing keeps happening: I read one, set it aside, pick up the next, and feel like I just read the same document twice. RAG with LangChain. Fine-tuned an LLM. Vector database. Prompt engineering. Same words, same structure, same level of vagueness, applicant after applicant.

This is what happens when a wave of new tooling hits a job market all at once. Two years ago every ML resume opened with collaborative filtering. Last year it was transformers and CNNs. This year it is RAG and fine-tuning. The technologies change. The pattern does not. Generic verb, popular framework name, no measurement, no scope, no signal. A senior engineer reviewing GenAI applications has read the LangChain bullet a hundred times by lunch. They are not reading it anymore. Their eyes find it and skip past it like boilerplate at the bottom of an email.

If your bullet sits in that pile, it does nothing for you. The fix is not better wording. It is showing the specific things a senior reviewer is actually looking for: the retrieval metric, the eval method, the scale, the system design choice, the failure mode you handled. Below are seven bullets I see every single week, why each one gets skipped, and the rewrite that does not.

The Seven Bullets Every Senior Reviewer Has Already Skipped

These are not strawmen. I pulled the wording from real resumes that landed in my queue this month. Names removed, language verbatim.

1. Built a RAG system using LangChain

This is the most common GenAI bullet on resumes right now. It is also the least informative. LangChain is glue code. Saying you used it tells me roughly as much as saying you used a for-loop. The interesting questions all live underneath: what was your chunking strategy, what embedding model did you pick and why, did you use a reranker, what was your top-k, how did you measure retrieval quality, what was the corpus size, what was the latency budget.

A reviewer reading the bare version learns nothing about your engineering judgment. They learn that you followed a tutorial.

Before

Built a RAG system using LangChain and OpenAI embeddings.

After

Built a retrieval system over 220k internal support tickets using BGE-large embeddings with a Cohere reranker on the top 50. Recall@10 improved from 0.61 (BM25 baseline) to 0.87 on a 500-query labeled eval set. P95 query latency under 400ms with HNSW indexing in pgvector.

The rewrite does not mention LangChain because LangChain is not the interesting part. The interesting parts are the corpus size, the choice to use a reranker, the named baseline, the measured improvement, and the latency budget. A reviewer reads this and thinks: this person made decisions and measured outcomes.

2. Fine-tuned a model on a custom dataset

This bullet is a black box. Which model. How big. Whose dataset. What size. Full fine-tuning or LoRA or QLoRA. Against what baseline. Evaluated how. I cannot answer any of those questions from the bullet, which means the bullet is doing no work.

Fine-tuning is also one of the easiest places to overclaim. Running the HuggingFace trainer on a 500-row dataset for two epochs counts as fine-tuning in a literal sense. It does not count as the kind of thing a senior engineer wants to hear about, unless you can show what you measured and why it mattered.

Before

Fine-tuned a large language model on a custom dataset.

After

Fine-tuned Llama-3-8B with QLoRA on 14k human-labeled customer intent examples. Lifted intent classification F1 from 0.73 (few-shot GPT-4o baseline) to 0.89 on a held-out test set of 1,500 examples. Replaced the GPT-4o classifier in production, cutting inference cost per ticket from $0.018 to $0.0007.

The cost number at the end is the part senior engineers care about most and the part new GenAI candidates leave out most often. Building something that works is one milestone. Building something that costs less than the off-the-shelf alternative is the milestone that gets you hired at a company actually running this in production.

3. Implemented prompt engineering for X

Prompt engineering is not a skill in 2026. It is a verb at best. When a bullet says 'implemented prompt engineering,' the reviewer reads 'wrote some prompts.' That is not a credential. Everyone has written some prompts.

The interesting version of this work is not the prompt. It is everything around the prompt. What was the failure mode you were trying to fix. What did you change in the system. How did you measure that the change helped. Did you A/B it against the previous version. Did you build an eval set, or are you eyeballing it.

Before

Implemented prompt engineering techniques to improve LLM output quality.

After

Diagnosed hallucination on 22% of customer-facing answers; added a structured citation step and a retrieval-grounded refusal path. Hallucination rate (human-graded on 300-sample weekly eval) dropped from 22% to 4%. No measurable regression in helpfulness score.

Notice this bullet never uses the phrase 'prompt engineering.' It does not need to. The work is described concretely enough that the reviewer can infer what changed and judge whether the work was real.

4. Used GPT-4 / Claude / Gemini for X

An API call is not engineering. Saying you 'used GPT-4' tells me you have a credit card. The engineering shows up in everything around the API: how you handled rate limits, what you cached, how you routed between models on cost vs. quality, what you did when the model failed, how you bounded the input.

If your bullet stops at 'used GPT-4,' you are signaling that you have not yet thought about the systems problems that come up when LLM features hit real traffic. That is fine if you are a beginner. It is a problem if you are claiming to be a GenAI engineer.

Before

Used GPT-4 to summarize meeting transcripts for users.

After

Built a summarization pipeline routing transcripts under 4k tokens to Haiku and longer transcripts to Sonnet with map-reduce chunking. Aggregate cost across 18k summaries per month dropped 71% vs. all-Sonnet baseline; user-rated summary quality (1k-sample A/B) unchanged. Added retry-with-fallback on Anthropic 529s; pipeline success rate moved from 96.4% to 99.8%.

This bullet is about systems, not models. The model is incidental. The cost routing, the fallback handling, the A/B against a baseline, those are the parts a senior reviewer is looking for.

5. Deployed an AI agent

'Agent' is the most overloaded word in this market right now. A chain with a single tool call is not what a senior engineer means by an agent. A scripted workflow that calls an LLM is not an agent. Saying you 'deployed an AI agent' without specifics tells a reviewer that either you are using the word loosely, or you are not sure what it means yourself.

If you actually built something that does multi-step reasoning with tools, the interesting things are: how many tools, how you handled tool selection errors, what your turn limit was, how you caught loops, what your observability looked like, and what fraction of runs completed successfully.

Before

Deployed an AI agent to automate internal workflows.

After

Built a Claude-based agent with 6 internal tools (ticket lookup, refund issue, escalation, KB search, customer email, audit log). Median 3.2 tool calls per run; 94% task completion across 4,200 production runs in the first quarter. Loop detection on repeated identical tool calls cut average runaway-run cost from $1.40 to $0.03.

That last sentence about loop detection is the kind of detail you only have if you shipped this thing and watched it burn money for a week. That is exactly the signal a senior reviewer is looking for.

6. Vectorized documents in a vector database

This bullet is the Pinecone tutorial output. Every bootcamp and every getting-started guide ends with a version of this sentence. It tells a reviewer that you ran the example, not that you built a system.

What would tell them you built a system: the corpus size, the embedding model choice and why, the chunking strategy and why, the index type, the recall metric you optimized for, and the production latency you held to.

Before

Vectorized 10,000 documents in Pinecone for semantic search.

After

Indexed 1.4M product reviews in pgvector using sentence-bge-small-en (chosen over text-embedding-3-large after a 2k-query bake-off: 91% of the recall at 8% of the embedding cost). Recursive chunking at 512 tokens with 64-token overlap. Recall@5 of 0.84 on a 600-query labeled eval; p95 retrieval latency 38ms.

The bake-off detail is what sells this. A reviewer reads that and immediately knows: this person did not just pick the most expensive embedding model and call it done. They tested alternatives and made a cost-quality trade-off.

7. Improved chatbot accuracy by X percent

This is the most dangerous bullet on the list, because it sounds like quantified impact when it is actually a hallucinated metric.

Accuracy of what. Measured how. Against what baseline. On which test set. By whose judgment. If you cannot answer those questions in the interview, the number is going to do you harm, not good. A senior interviewer will ask, you will pause, and the credibility cost will outweigh whatever lift you got from the metric being on the page.

Before

Improved chatbot accuracy by 34%.

After

Built a labeled eval set of 800 historical support conversations graded by 2 senior agents on a 5-point helpfulness rubric. Iterated on retrieval and prompting until average helpfulness moved from 2.8 to 4.1 on a held-out 200 conversations. Used GPT-4o as a judge for weekly regression checks (correlation 0.78 with human grading).

The rewrite does not look as punchy. It also describes a thing that actually happened, and you can talk about every part of it in an interview. That trade is worth making every time.

What a Credible GenAI Bullet Contains

If you read the seven rewrites and squint, the same scaffolding appears in each one. It is not coincidence. There is a small set of dimensions a senior reviewer is unconsciously checking for on every GenAI bullet. The bullets that include these dimensions survive screening. The ones that do not get skipped.

The retrieval or quality metric. Recall@k, NDCG, F1, hallucination rate, helpfulness score, exact-match. A specific named metric the reader can interpret.
The evaluation method. Held-out set with size, human grading with rubric, LLM-as-judge with the judge model named and correlation reported, or A/B against a baseline. 'Accuracy went up' without an eval is hand-waving.
The scale. Corpus size, query volume per day, document count, request volume. Numbers that tell a reader what kind of system this was.
The cost or latency budget. Dollars per request, tokens per response, p95 latency in ms, monthly inference spend. GenAI work that ignores cost is work that has not been shipped.
The system design choice. Chunking strategy and reasoning, embedding model and reasoning, model routing logic, reranker yes or no, fallback path. A decision you made and can defend.

A GenAI bullet that hits 3 of these 5 dimensions is doing real work. A bullet that hits 0 of 5 is decoration. Run your own bullets through this list and you will see immediately which ones are pulling weight.

Three Worked Examples

Pattern is one thing. Specific rewrites are another. Here are three full before-after examples from common GenAI roles, with reasoning.

RAG Engineer

Before

Developed a RAG application for enterprise knowledge management using LangChain, OpenAI, and Pinecone. Improved information retrieval accuracy and reduced response time.

After

Shipped a retrieval system over 380k internal engineering documents (Confluence + Slack archives) for a 600-engineer org. Hybrid retrieval: BM25 + dense (BGE-large) with reciprocal rank fusion, Cohere rerank on top 40. Recall@10 of 0.91 on a 1,200-query labeled set built with internal SMEs. P95 query latency 520ms end-to-end. Production traffic of 4-7k queries per day.

The first version uses six recognizable framework names and tells a reviewer nothing. The rewrite names two technical choices that show judgment (hybrid retrieval, reranker), reports two quality and latency metrics, and gives a sense of scale (org size, traffic). Same project. Completely different signal.

Fine-tuning and Model Training

Before

Fine-tuned LLMs on proprietary data to improve task performance for downstream applications.

After

Trained a domain-specific classifier by fine-tuning Mistral-7B with LoRA on 28k legal contract clauses labeled across 14 categories. Macro-F1 of 0.91 on held-out test set of 3k clauses, vs. 0.74 for the prior few-shot GPT-4 classifier. Quantized to 4-bit for inference on a single A10G; runs at $0.0004 per classification vs. $0.011 for the GPT-4 baseline.

Two things this rewrite does that the original does not. First, it names the trade-off: a smaller model plus fine-tuning beat a larger frontier model on quality and cost for this specific task. Second, it grounds itself in dollars and macro-F1, both of which the candidate can defend in interview.

LLM Evaluation and RLHF

Before

Conducted LLM evaluations using human feedback to align model outputs with user preferences.

After

Built an internal eval harness running 24 task-specific suites (summarization, intent classification, tool use, refusal) on a nightly schedule against 4 candidate model versions. Each suite combined LLM-as-judge (GPT-4o, validated against 500 human-graded responses, rho = 0.81) with exact-match where applicable. Surfaced 3 silent regressions in the last 6 months that would have shipped under previous spot-check workflow.

Eval work is hard to write about because the output is process, not a product. The rewrite handles that by quantifying the eval surface (24 suites, 4 versions, nightly), validating the judge (rho against human grading), and naming the saves (3 regressions caught). It tells a story about a system that does something measurable.

GenAI-Specific Anti-Patterns

There is a set of mistakes that show up almost exclusively on GenAI resumes. The general resume advice does not catch them. They deserve their own list.

The vague baseline problem

'Improved accuracy by 30%' compared to what. Random guessing. GPT-3.5. The prior in-house model. A coin flip. A version of the same model with a worse prompt. Without a named baseline, any percentage is a number without a denominator.

This matters more for GenAI than for traditional ML, because the baseline is usually obvious for traditional ML (your prior model) but completely ambiguous for GenAI work (no prior system, or a different model entirely). State the baseline explicitly. 'Recall@10 improved from 0.61 BM25 baseline to 0.87.' That sentence has a denominator. The reader knows what you compared against.

The demo-not-prod problem

This is endemic to GenAI resumes. A Streamlit demo built in a weekend gets written as 'launched an internal AI tool.' A Colab notebook with a working RAG pipeline gets written as 'deployed a production retrieval system.' An hour-long chat interface gets written as 'shipped to users.'

Recruiters and engineers can tell the difference. The tells are predictable: no traffic numbers, no latency numbers, no failure handling, no cost mentioned. A real production system has all of these because production forces you to care about them. A demo does not.

If you built a demo, call it a demo or a prototype. 'Prototyped a RAG-based internal search tool; presented to the team and validated retrieval quality on a 100-query test set' is honest and respectable. Calling the same thing 'shipped a production RAG system' is not, and it will collapse the first time an interviewer asks about your queries-per-second.

The framework name dump

LangChain plus LlamaIndex plus Pinecone plus ChromaDB plus Weaviate plus Qdrant plus FAISS, all listed in one bullet or one skills line, with no project that ever used more than two of them.

This pattern shows up in resumes trying to keyword-match every job description in the GenAI space at once. The result is the opposite of what was intended. A senior reviewer scans the list, notices no project description names more than two of the tools, and concludes the candidate has played with examples but not built with intent. List what you have actually used and shipped. The rest is noise.

The 'I read the paper' bullet

'Implemented attention mechanism from Transformer paper.' 'Studied RLHF training methodology.' 'Built understanding of mixture-of-experts architectures.' These bullets appear because the candidate genuinely understands the material and wants credit for it. The problem is that understanding a paper is not shipped work. It is preparation.

If you reimplemented something from a paper, write about what you built, what you measured, and what you learned that the paper did not tell you. 'Reimplemented FlashAttention from the paper in a small training loop; matched the reported speedup on H100 to within 8% and identified a memory-layout gotcha that cost me a week.' That is interesting because it has scope, measurement, and an honest moment of difficulty. The bullet about 'studying' the paper is decoration.

When Side Projects Are All You Have

Most candidates breaking into GenAI roles in 2026 do not have GenAI production experience. They have a few months of API tinkering, a couple of side projects, maybe a Kaggle competition or a Hugging Face submission. The resume challenge is showing real engineering judgment from a portfolio that is mostly self-directed.

The bullets above work the same way for side projects as they do for production work, with one caveat: scope honesty matters more, because the reviewer already knows this is a side project and is mentally adjusting for that. Overclaiming on a side project breaks trust faster than anywhere else.

Things that should appear on a strong side-project GenAI bullet:

A dataset, even a small one, that you built yourself or sourced and cleaned. 'Labeled 400 examples from r/legaladvice for jurisdiction classification' is more impressive than the model architecture, because it shows you understand that evaluation requires labels.
A reproducible eval. A test set, a metric, a baseline. 'Compared my fine-tune against GPT-4o-mini few-shot on the same 100-question test split.' A side project without an eval is a demo.
A repository link with the code. Most reviewers will not click. The link still does work, because it signals that you are not afraid of the code being seen.
An honest scope statement. 'Weekend project; dataset is 400 rows; conclusions hold on this distribution only.' Self-aware scope reads as competence, not weakness.

Things to avoid:

Production scale claims. 'Deployed to thousands of users' for a side project is almost always either wrong or technically true in a way that misrepresents the engineering. Do not do this.
Vague impact. 'Helped users save time' has no place on any resume, but it is especially weak on a side project where you usually do not have users at all.
Framework name decoration. The point of a side project bullet is to show how you think, not how many libraries you have heard of.

A good side-project bullet for someone breaking in: 'Built and labeled a 600-row dataset of customer support tickets across 8 intent categories; fine-tuned distilBERT with LoRA and compared against GPT-4o-mini few-shot on a held-out 150-row split. distilBERT macro-F1 of 0.84 vs. 0.79 for the few-shot baseline. Code and dataset on GitHub.' That bullet does more for a junior GenAI candidate than ten LangChain mentions ever will.

Run Your Own Bullets Through the Filter

If you take one thing from this guide, take this: every GenAI bullet on your resume should answer at least three of these five questions. What did you measure. How did you measure it. What did you compare against. How big was the system. What did it cost, or how fast did it have to be.

If a bullet answers zero of those, you have written a sentence that looks like a GenAI bullet but is functionally invisible to a senior reviewer. Rewrite it or cut it. The third option, leaving it sitting there, is the worst one, because it takes up space a stronger bullet could have used.

GenAI hiring in 2026 is competitive enough that the bar is not 'has worked with LLMs.' Everyone has worked with LLMs. The bar is 'has worked with LLMs in a way that included measurement, trade-offs, and shipping discipline.' The bullets on your resume are the only evidence a reviewer has of which side of that line you are on.

Tools That Might Help

Lint Your Resume

→

Upload your resume and get line-by-line feedback on which bullets pull weight and which ones get skipped. Flags vague baselines, missing scope, and demo-not-prod overclaiming.

Build Your Resume

→

Single-column, ATS-safe structure, with guided prompts that push you to add the metric, the baseline, and the scope before you export.

The Seven Bullets Every Senior Reviewer Has Already Skipped

1. Built a RAG system using LangChain

2. Fine-tuned a model on a custom dataset

3. Implemented prompt engineering for X

4. Used GPT-4 / Claude / Gemini for X

5. Deployed an AI agent

6. Vectorized documents in a vector database

7. Improved chatbot accuracy by X percent

What a Credible GenAI Bullet Contains

Three Worked Examples

RAG Engineer

Fine-tuning and Model Training

LLM Evaluation and RLHF

GenAI-Specific Anti-Patterns

The vague baseline problem

The demo-not-prod problem

The framework name dump

The 'I read the paper' bullet

When Side Projects Are All You Have

Run Your Own Bullets Through the Filter

Tools That Might Help

Further Reading