Rejectless

Resume feedback

The Reviewer’s Notebook

No. 02 · AI engineer edition

AI engineer resume examples, annotated line by line.

Q: Is AI engineer the same as machine learning engineer on a resume?

Not quite. ML engineer resumes lead with model accuracy, training cost, and inference throughput. AI engineer resumes — especially since 2024 — lead with retrieval quality, LLM eval scores, cost-per-request, and system design around model APIs. Both use the same metric-method-scope frame, but the emphasis shifts. If your work is mostly LLM, RAG, agent, or fine-tuning, write it as an AI engineer resume.

Q: Do I need a PhD or ML background for an AI engineer role?

No. Most AI engineer postings in 2026 want shipped LLM systems, not papers. What matters on the resume is retrieval quality, eval rigor, and cost discipline. A PhD goes in education. The bullets prove you can build, evaluate, and ship the kind of LLM-powered system that hits real traffic.

Q: How do I write RAG bullets if I don't have production traffic?

Build a labeled eval set, even a small one. 500 queries with relevance labels you authored yourself is enough to report Recall@k or NDCG with a real baseline (BM25 or a single-vector dense baseline). Name the corpus size, the chunking and embedding choice, and the metric. A side-project RAG bullet with a real eval beats a production RAG bullet with no metric.

Q: Should I list LangChain, LlamaIndex, and other GenAI frameworks?

Only the one or two you actually built with, and only when tied to a measurable outcome. A line that lists LangChain, LlamaIndex, Pinecone, ChromaDB, Weaviate, Qdrant, and FAISS in one bullet signals you played with examples, not that you built with intent. List what you shipped. The rest is noise.

Q: What's the difference between an AI engineer and LLM engineer resume?

LLM engineer resumes go deeper on model behavior — fine-tuning, RLHF, quantization, serving choices like vLLM and TGI. AI engineer resumes are broader — retrieval, agents, evals, model routing, and system design around the model. If your work is mostly inside the model boundary, write LLM engineer. If your work is mostly around the model, write AI engineer. Most resumes blur the two; this is fine if the bullets are specific enough to make the work clear.

Q: How do I write agent bullets that don't read as overclaiming?

Name the tool count, the turn limit, the loop-detection logic, the median tool calls per run, and the success rate across a real run count. "Built a Claude-based agent with 6 internal tools; median 3.2 tool calls per run; 94% task completion across 4,200 production runs" is an agent bullet. "Deployed an AI agent" is wallpaper. The detail proves you shipped it; the absence of detail signals you did not.

Q: Should I list every model I've worked with — GPT-4, Claude, Gemini, Llama, Mistral?

Only models that appear in a project with a measured outcome. A skills line that lists eight model families with no project that named more than two of them reads as keyword-matching. Name the two or three you actually shipped with, in the context of the work that used them.

Q: How long should an entry-level AI engineer resume be?

One page, with project bullets that look like AI engineer bullets — not bootcamp deliverables. Build one labeled dataset yourself. Run one fine-tune or RAG eval against a named baseline. Show the metric, the method, and an honest scope statement. The entry-level example on this page is exactly that shape.

Four real-shaped resumes — RAG engineer, senior LLM engineer, GenAI / agent engineer, and an entry-level portfolio resume — read the way a senior reviewer reads them. Bullets that work get a thin margin note. Bullets that don’t get rewritten in front of you.

Every other page on this query shows full resumes with a hundred words of generic praise. We do the opposite: fewer examples, deeper marks. You can read the whole thing without signing up for anything. The linter at the bottom is for the part where you turn around and do the same thing to your own resume — or start a fresh page in the same Jake’s format.

Skip to the first example →Start building your résumé →Or start with what goes wrong

Chapter 0

Four bullets that read the same to every senior reviewer

The lines below are not strawmen. They appear, almost verbatim, on most AI engineer resumes that hit my queue this month. Each one fails for a different reason.

“Built a RAG system using LangChain and OpenAI embeddings”
This is the most common AI engineer bullet on the market right now, and the least informative. LangChain plus OpenAI is glue plus a vendor; the engineering lives one layer down. A senior reviewer reading this learns the candidate followed a tutorial, not that they made decisions about chunking, embedding model, reranker, or eval. The bullet does no work.
“Fine-tuned a large language model on a custom dataset”
Which model. Which dataset. How big. Full fine-tune, LoRA, QLoRA? Against what baseline. Evaluated how. The bullet hides every interesting decision and earns its writer nothing. Without the F1 lift against a named baseline and the cost-per-inference comparison, the work could be anything from a 500-row notebook run to a frontier-API replacement.
“Deployed an AI agent to automate internal workflows”
'Agent' is the most overloaded word on AI resumes right now. A chain with one tool call is not what a senior engineer means by an agent. Without tool count, turn limit, success rate over a real run count, and a loop-detection or observability detail, this bullet signals the writer has not actually watched an agent loop in production.
“LangChain, LlamaIndex, Pinecone, ChromaDB, Weaviate, Qdrant, FAISS, vLLM, TGI”
A list of tools with no project that named more than two of them is keyword-matching, not credentialing. A senior reviewer reads this and concludes the candidate has played with examples but not shipped with intent. The fix is not to delete the list; it is to attach two of them to a measured outcome elsewhere on the page.

Chapter I — III

Four resumes, read closely

Each résumé is rendered the way it would be sent: Jake’s template, single page, compressed. The notes in the margin are mine. Bullets that work get a brief acknowledgement — there’s no reason to be nice about them, just a reason to point at why. Bullets that don’t are rewritten in front of you.

RAG engineer at a B2B SaaS

A mid-level RAG engineer who shipped a retrieval system over internal support tickets. The bullets that land share a pattern: every retrieval metric has a baseline, every model choice has a reason, every latency number names the p95. Where they slip is the framework-list bullet, which gets rewritten in front of you.

TipClick any flagged bullet to read the reviewer’s margin note

Maya Rao

maya.rao@email.com | linkedin.com/in/mayarao | github.com/mayarao-ai

Education

University of Texas at Austin

BS, Computer Science2022

Experience

Helpdesk.io2024 – Present

AI Engineer, RetrievalAustin, TX

Shipped the company's first production retrieval system over 220k internal support tickets; hybrid retrieval (BM25 + BGE-large) with a Cohere reranker on the top 50; Recall@10 improved from 0.61 BM25 baseline to 0.87 on a 500-query labeled eval; p95 query latency under 400ms with HNSW indexing in pgvector.
Authored the team's first labeled retrieval eval set: 500 queries with relevance graded by 3 senior support engineers on a 4-point scale; harness runs nightly and gates 100% of retriever changes.
Ran a 2,000-query embedding-model bake-off across BGE-large, text-embedding-3-large, and a quantized BGE-small; kept BGE-large for the high-recall path and routed long-tail queries to BGE-small, cutting monthly embedding cost by 34% with no measurable Recall@10 regression.
Used vector databases and LLMs to build search.
Cut p95 retrieval latency from 920ms to 380ms by switching from a flat IVF index to HNSW and moving rerank batching off the hot path; held a Recall@10 regression budget of <1.5% and shipped at 0.4%.
Built the chunk-strategy A/B harness: recursive 512-token with 64-token overlap beat fixed 256-token and full-doc on Recall@5 by 11 points; documented the test and removed three competing chunkers from the codebase.
Owned the retrieval on-call rotation for 2 quarters across 38 incidents; median time to detect 4 minutes, median time to mitigate 22 minutes; wrote the retrieval-quality runbook now used by 2 adjacent teams.

Northgate Analytics2022 – 2024

Machine Learning EngineerAustin, TX

Built the first ranking model for the support search surface: gradient-boosted ranker on 14M historical click pairs; nDCG@10 from 0.42 to 0.58 on a 1,200-query holdout; served at 6K req/s on a single instance with 110ms p95.
Migrated the embedding pipeline from a managed API to a self-hosted Sentence-Transformers serving stack on Triton; cut monthly inference cost from $11.4k to $1.8k with no measurable Recall@5 regression on the production eval.
Worked closely with the product team on improving search relevance.
Designed the feature store for the ranking model on top of a Postgres + Redis split; cut feature-fetch p95 from 78ms to 14ms and removed 4 hot-path API calls from the search path.
Authored the team's offline-vs-online eval harness; surfaced a 9% offline-online gap caused by stale features and shipped a 30-minute refresh cadence that closed the gap to 1.2%.

Projects

Open-source labeled retrieval eval set | github.com/mayarao-ai/support-eval

1,100 queries with 4-point relevance labels over a public support corpus; used by 6 third-party teams as a reproducible retrieval benchmark; cited in 2 internal write-ups at vendor companies.

Technical Skills

Retrieval: pgvector, HNSW, FAISS, BM25, hybrid retrieval, Cohere rerank

LLM stack: BGE, Sentence-Transformers, OpenAI embeddings, Triton

Languages: Python, Go, SQL

The reviewer’s margin notes

6 notes

Click a flagged bullet or a note to highlight its pair.

StrongStrong: corpus, retrieval metric with baseline, latency

Shipped the company's first production retrieval system over 220k internal support tickets; hybrid retrieval (BM25 + BGE-large) with a Cohere reranker on the top 50; Recall@10 improved from 0.61 BM25 baseline to 0.87 on a 500-query labeled eval; p95 query latency under 400ms with HNSW indexing in pgvector.

Five hard signals in one line. Corpus size (220k tickets), system design choice (hybrid + reranker), retrieval metric with a baseline (Recall@10 from 0.61 BM25 to 0.87), eval scope (500 labeled queries), and the latency budget (p95 under 400ms). This is the bullet a senior reviewer scans for on every RAG resume and almost never finds.

StrongStrong: built the eval, not just used one

Authored the team's first labeled retrieval eval set: 500 queries with relevance graded by 3 senior support engineers on a 4-point scale; harness runs nightly and gates 100% of retriever changes.

Most candidates list a metric and pretend the eval set authored itself. This bullet credits the work that makes the metric real: 500 queries, human graders, gating CI. Authoring the eval set is the engineering, not the embedding model choice.

StrongStrong: cost-vs-quality bake-off

Ran a 2,000-query embedding-model bake-off across BGE-large, text-embedding-3-large, and a quantized BGE-small; kept BGE-large for the high-recall path and routed long-tail queries to BGE-small, cutting monthly embedding cost by 34% with no measurable Recall@10 regression.

A bake-off detail with a named cost delta is the part of a RAG resume that says the candidate did not just pick the most expensive embedding model. Routing long-tail queries to a smaller model is exactly the kind of trade-off a senior LLM team asks about in interview.

RewriteRewrite: framework-name dump

Used vector databases and LLMs to build search.

This bullet sits between two strong bullets and does nothing. It names two tool categories and one outcome word, all of which appear elsewhere on the resume in better form. Either replace it with a specific outcome or delete it entirely.

↳ Example rewrite

Rolled out citation-grounded answers across 4 product surfaces; hallucination rate on a 300-sample weekly eval dropped from 18% to 3.6% with no measurable regression in helpfulness score.

StrongStrong: chunking decision named and quantified

Built the chunk-strategy A/B harness: recursive 512-token with 64-token overlap beat fixed 256-token and full-doc on Recall@5 by 11 points; documented the test and removed three competing chunkers from the codebase.

Chunking strategy is the most overlooked RAG decision and the easiest to fake on a resume. Naming the comparison (recursive 512 with 64 overlap vs fixed 256 vs full-doc), the metric (Recall@5), and the cleanup (three chunkers removed) shows the candidate ran the experiment instead of arguing about it.

WeakWeak: 'worked closely with' filler

Worked closely with the product team on improving search relevance.

This sentence describes process, not work. 'Worked closely with the product team' tells a senior reviewer that the candidate ran out of metrics and reached for a soft skill. The bullets on either side are specific; this one breaks the rhythm.

↳ Example rewrite

Partnered with product to define the v2 ranking rubric (12 query-type buckets, weighted relevance vs freshness); rubric now blocks 100% of ranking model promotions and exposed a 14% relevance gap on the long-tail bucket we had been missing.

Takeaway

RAG resumes do not win on framework names. They win on the corpus size, the retrieval metric with a baseline, the latency budget, and one defensible system design choice. Everything else is decoration.

Senior LLM engineer at an AI-native startup

A senior engineer whose work is mostly inside the model boundary: fine-tuning, quantization, serving, cost-per-inference. The bullets that land all answer the same question — what did you measure, against what baseline, and what did it cost. The weak bullet is the most common one on LLM resumes: the model lift with no eval set named.

TipClick any flagged bullet to read the reviewer’s margin note

Daniel Park

daniel.park@email.com | linkedin.com/in/danparkml | github.com/danparkml

Education

Carnegie Mellon University

MS, Machine Learning2021

Experience

Avon Labs2023 – Present

Senior LLM EngineerSan Francisco, CA

Replaced the GPT-4o classifier in the support-routing pipeline with a QLoRA fine-tune of Llama-3-8B on 14k human-labeled intent examples; macro-F1 from 0.73 GPT-4o few-shot baseline to 0.89 on a held-out 1,500-example test set; cost per ticket from $0.018 to $0.0007 across 2.8M monthly tickets.
Built the team's vLLM serving stack with paged attention and prefix caching; cut p95 latency on the 8B classifier from 740ms to 180ms and increased steady-state throughput from 38 to 142 req/s per A10G.
Quantized the production 8B classifier to 4-bit AWQ; held a regression budget of <1% macro-F1 and shipped at 0.6%, freeing one A10G of inference budget per region.
Improved model quality.
Authored the team's offline eval harness: 4 task suites (intent, summarization, refusal, tool-use), 800-2,400 examples each, judged by GPT-4o validated against 500 human-graded responses with Spearman 0.81; harness blocks 100% of production model promotions.
Designed the model-routing service: 4 candidate models, 6 task-class buckets, routed by a calibrated cost-vs-quality score; cut aggregate monthly inference spend by 71% vs an all-frontier-API baseline with no helpfulness regression on a 1k-sample A/B.
Owned the LLM serving on-call rotation across 4 quarters; mean time to detect 6 minutes, mean time to mitigate 19 minutes across 27 LLM-related incidents; wrote the LLM incident runbook now used by 3 teams.

Cypress Analytics2021 – 2023

Machine Learning EngineerRemote

Trained a 350M-parameter encoder for legal-clause classification on 28k labeled contract clauses; macro-F1 of 0.91 on a 3k-clause held-out test set, vs 0.74 for the prior few-shot GPT-3.5 classifier; deployed via Triton at 1.4k req/s with 88ms p95.
Built the team's first prompt-evaluation harness for a customer-facing summarization feature; surfaced a 22% hallucination rate on a 300-sample weekly eval and drove the fix to a structured citation step that cut hallucination to 4% with no helpfulness regression.
Replaced the team's manual annotation workflow with a Label Studio + active-learning loop; cut the labeled-clause-per-hour rate from 38 to 121 for senior labelers and reduced labeling cost per 1k clauses from $310 to $94.
Ran the migration off LangChain agents into a hand-written tool-orchestration layer; cut median latency on the agent flow from 4.8s to 1.2s and removed 6 framework upgrades from the team's quarterly maintenance load.

Technical Skills

Fine-tuning: LoRA, QLoRA, AWQ, GPTQ, Unsloth

Serving: vLLM, Triton, TGI, paged attention, prefix caching

Models: Llama-3, Mistral, GPT-4o, Claude (production); Llama-2, Mistral-7B (prior)

Languages: Python, CUDA, Rust

The reviewer’s margin notes

6 notes

Click a flagged bullet or a note to highlight its pair.

StrongStrong: F1 lift + cost-per-inference vs named baseline

Replaced the GPT-4o classifier in the support-routing pipeline with a QLoRA fine-tune of Llama-3-8B on 14k human-labeled intent examples; macro-F1 from 0.73 GPT-4o few-shot baseline to 0.89 on a held-out 1,500-example test set; cost per ticket from $0.018 to $0.0007 across 2.8M monthly tickets.

The cost-per-inference comparison is the part senior LLM engineers care about most and the part most candidates leave out. Naming GPT-4o few-shot as the baseline is the move; without it, the F1 lift is hand-waving. Multiplying the unit cost by monthly volume is what closes the case for shipping it.

StrongStrong: serving choices named and quantified

Built the team's vLLM serving stack with paged attention and prefix caching; cut p95 latency on the 8B classifier from 740ms to 180ms and increased steady-state throughput from 38 to 142 req/s per A10G.

Paged attention plus prefix caching is the credential. The latency and throughput-per-GPU numbers prove the candidate measured the work instead of inheriting it. A senior interviewer reads this and immediately knows what stack the candidate has touched.

StrongStrong: quantization with a regression budget

Quantized the production 8B classifier to 4-bit AWQ; held a regression budget of <1% macro-F1 and shipped at 0.6%, freeing one A10G of inference budget per region.

Quantization wins that hide a quality drop are one of the most common bait-and-switches on LLM resumes. Naming the regression budget (<1% F1) and the actual outcome (0.6%) signals the candidate disclosed the trade-off instead of pretending the model came out free.

RewriteRewrite: 'improved model quality'

Improved model quality.

Three words doing the work of a bullet. Surrounded by specific bullets, this one is invisible. Senior reviewers do not skip it gently; they take a small credibility tax on the resume for trying.

↳ Example rewrite

Diagnosed and fixed a refusal-rate spike on the support classifier: over-refusal jumped from 3.1% to 11.4% after a 2024-06 base-model upgrade; shipped a calibrated threshold + a 6-example refusal eval and brought over-refusal back to 2.8% within two weeks.

StrongStrong: LLM-as-judge with validation

Authored the team's offline eval harness: 4 task suites (intent, summarization, refusal, tool-use), 800-2,400 examples each, judged by GPT-4o validated against 500 human-graded responses with Spearman 0.81; harness blocks 100% of production model promotions.

LLM-as-judge is the most overclaimed technique on AI resumes right now. Validating the judge against 500 human-graded responses with a reported correlation is the difference between an evaluation that survives interview follow-up and one that does not.

StrongStrong: model routing with cost outcome

Designed the model-routing service: 4 candidate models, 6 task-class buckets, routed by a calibrated cost-vs-quality score; cut aggregate monthly inference spend by 71% vs an all-frontier-API baseline with no helpfulness regression on a 1k-sample A/B.

The 71% spend cut is the headline; the 'no helpfulness regression on a 1k A/B' is the disclosure that makes it defensible. Either number alone would be challengeable; the pair tells a complete story about the engineering trade-off.

Takeaway

LLM engineer bullets live or die on cost-per-inference and a defensible eval. Frontier APIs are the silent baseline; if your fine-tune cannot articulate why it shipped instead of GPT-4o-mini, the bullet is decoration.

Intermission

Same review, your résumé

The marks on the right are written by the same reviewer that flags the bullets above — line by line, on every line of your résumé.

Click any of the three bullets to expand it. No rewrites. No keyword games. Just the missing pieces.

Lint your résumé →

resume.md — three weak bullets

click to expand

line 11

Built a RAG system using LangChain and OpenAI embeddings.

Click bullet to open

line 12

Fine-tuned a large language model on a custom dataset.

Click bullet to open

line 13

Deployed an AI agent to automate internal workflows.

Click bullet to open

GenAI engineer (agent + tool use)

A mid-level GenAI engineer who built and shipped a tool-using agent for internal workflows. The bullets that earn trust all share the same texture: tool count, turn limit, success rate over a real run count, and one detail you only learn from watching the agent burn money. The weak bullet is the chatbot framing.

TipClick any flagged bullet to read the reviewer’s margin note

Sara Lin

sara.lin@email.com | linkedin.com/in/saralin-ai | github.com/saralin

Education

Columbia University

BS, Computer Science2022

Experience

Northwind Support2024 – Present

GenAI Engineer (Agent Platform)Brooklyn, NY

Built and shipped a Claude-based agent with 6 internal tools (ticket lookup, refund issue, escalation, KB search, customer email, audit log); median 3.2 tool calls per run, 94% task completion across 4,200 production runs in Q1, and a 31% reduction in median ticket resolution time vs the prior human-only baseline on a stratified 800-ticket A/B.
Designed and shipped loop detection on repeated identical tool calls; cut average runaway-run cost from $1.40 to $0.03 and capped p99 cost-per-run at $0.18 after a 2024-08 incident that burned $4,200 in one weekend.
Built an LLM-powered chatbot.
Authored the team's agent eval harness: 320 task scenarios across 8 categories, scored on task completion + tool-call efficiency; nightly run gates promotions and surfaced 3 silent regressions across two model upgrades in 6 months.
Owned the agent observability surface: per-run tool-call trace + cost + outcome streamed to Datadog; median time to root-cause an agent failure dropped from 42 minutes to 6 minutes across 18 production incidents.
Rolled out a calibrated turn-limit policy: 8 turns for support tickets, 4 turns for refund flows, 12 turns for escalation; cut median tokens-per-run by 38% while keeping task completion within 1 point of the pre-policy baseline.
Designed and shipped the agent's structured-citation step: every response includes the KB article ID and ticket history used; reduced human-graded hallucination from 14% to 2.8% on a 300-run weekly eval with no helpfulness regression.

Branch Pet Insurance2022 – 2024

Machine Learning EngineerBrooklyn, NY

Built a claims-intent classifier on 11k labeled examples; macro-F1 from 0.69 few-shot GPT-3.5 baseline to 0.87 on a 1.2k held-out set; replaced the LLM call in the claims pipeline and cut classification cost per claim from $0.012 to $0.0004.
Ran the migration off LangChain in the claims pipeline: replaced the multi-tool chain with a 240-line hand-rolled orchestrator; median latency from 3.4s to 0.9s and removed 4 framework upgrades from the team's quarterly maintenance load.
Used GPT-4 to summarize claim notes.
Built and labeled an internal eval set of 400 claim summaries graded by 3 senior adjusters on a 5-point rubric; harness now blocks 100% of claim-summarization model promotions.

Technical Skills

Agents: Anthropic tool use, OpenAI function calling, hand-rolled orchestration

Models: Claude 3.5, GPT-4o, Llama-3, Mistral

Observability: Datadog, Honeycomb, custom run traces

Languages: Python, TypeScript, Go

The reviewer’s margin notes

5 notes

Click a flagged bullet or a note to highlight its pair.

StrongStrong: tool count + median calls + run count + outcome

Built and shipped a Claude-based agent with 6 internal tools (ticket lookup, refund issue, escalation, KB search, customer email, audit log); median 3.2 tool calls per run, 94% task completion across 4,200 production runs in Q1, and a 31% reduction in median ticket resolution time vs the prior human-only baseline on a stratified 800-ticket A/B.

This bullet is the agent-engineer bullet senior reviewers actually want to see. Tool count (6), median tool calls (3.2), task completion (94%) over a real run count (4,200), and the downstream business outcome (31% faster resolution on an A/B). Each number is challengeable in interview, and each one has an answer.

StrongStrong: loop detection with a real incident

Designed and shipped loop detection on repeated identical tool calls; cut average runaway-run cost from $1.40 to $0.03 and capped p99 cost-per-run at $0.18 after a 2024-08 incident that burned $4,200 in one weekend.

The $1.40 to $0.03 cost-per-run number is the kind of detail you only have if you shipped the agent and watched it loop. The incident reference (a 2024-08 weekend that cost $4,200) is honest in a way that hand-waved 'we monitor costs' bullets are not.

RewriteRewrite: 'built an LLM-powered chatbot'

Built an LLM-powered chatbot.

This bullet appears on every AI resume right now and earns its writer nothing. It tells a senior reviewer there is a chat interface somewhere; it does not tell them what the candidate did. The rest of the role makes it clear there is more here than a chatbot framing can carry.

↳ Example rewrite

Shipped the customer-facing chat surface on top of the agent; A/B against the prior keyword-FAQ flow on 12k sessions showed a 28% drop in escalation rate and a 1.4-point CSAT lift with no measurable hallucination regression on the weekly grading eval.

StrongStrong: turn-limit policy as a real engineering decision

Rolled out a calibrated turn-limit policy: 8 turns for support tickets, 4 turns for refund flows, 12 turns for escalation; cut median tokens-per-run by 38% while keeping task completion within 1 point of the pre-policy baseline.

Turn limits are one of the most under-discussed agent design choices on resumes. Naming the per-task budgets (8/4/12) and quantifying the token reduction with the completion-rate disclosure shows the candidate understood that capping turns is an engineering decision with a trade-off, not a config tweak.

WeakWeak: 'used GPT-4'

Used GPT-4 to summarize claim notes.

Calling an API is not engineering. The bullets above and below this one are specific; this one signals the candidate could not produce a metric or system detail for this piece of work. Either it needs measurement or it should be cut.

↳ Example rewrite

Built the claim-summarization pipeline routing 4k-token notes to Sonnet and longer notes to map-reduce + Haiku; aggregate cost across 28k summaries per month dropped 62% vs all-Sonnet, with no measurable adjuster-rated quality regression on a 200-sample weekly grading.

Takeaway

Agent bullets are graded on whether the writer has actually watched an agent loop. Loop-detection numbers, runaway-cost mitigations, and turn-limit decisions are the signals a senior reviewer reads for. 'Deployed an AI agent' tells a reviewer the writer has not.

Chapter IV

Patterns that hold up

The seven things that appear in every annotated example above. If your bullets miss two or three of these, that is the rewrite list. The frame applies to AI engineer resume bullet points line by line, and the longer write-up of the most common GenAI / LLM / RAG bullets and their rewrites lives in the GenAI resume bullets guide.

Retrieval metric named with a baseline
Recall@k, NDCG, MRR — pick one and report it against a named baseline (BM25, single-vector dense, the prior in-house retriever). A retrieval bullet without a baseline is a number without a denominator. The baseline is what makes the lift defensible in interview.
Eval set authored, not borrowed
The strongest AI engineer bullets credit the work that built the eval set: query count, label rubric, who graded, how often it runs. Most candidates use a metric and pretend the labels authored themselves. Naming the eval-set work is the part that proves the candidate understands evaluation is engineering.
Cost-per-inference is the silent metric
Frontier API pricing is the baseline for every fine-tune and routing decision in 2026. A bullet that ships a fine-tune without saying what it cost to run vs the API it replaced is missing the punchline. Cost-per-classification, cost-per-summary, cost-per-run — pick the unit and report it.
One system design choice, defensible
Chunking strategy, reranker yes or no, hybrid retrieval, model routing logic, quantization choice, turn-limit policy. Every AI engineer resume should have at least one bullet that names a non-obvious decision and the reasoning. Decisions distinguish engineers from candidates who followed a tutorial.
Scope: corpus size, traffic, run count
220k tickets, 4k req/s, 4,200 production runs, 1.4M DAU. Numbers that tell a reader what kind of system this was. Bullets that omit scope read as homework assignments and lose to bullets that include it, even when the underlying work is similar.
Quality regression named on every shipping win
Quantization wins that hide an F1 drop, latency wins that hide a hallucination spike, routing wins that hide a helpfulness regression. The bait-and-switch is the most common pattern on AI resumes. The credible version names the regression budget held to and the actual outcome.
Shipped vs prototyped, separately
Production agent at 4k runs and a Streamlit demo built in a weekend belong on the same resume only if they are labeled as what they are. Overclaiming on the demo is the fastest way to lose credibility in interview. 'Prototyped' is honest, respectable, and harder to undermine than 'shipped' applied to a Colab notebook.

A worked example

“Recall@10 improved from 0.61 (BM25 baseline) to 0.87 on a 500-query labeled eval, at p95 query latency under 400ms across 220k internal support tickets.”

Retrieval metric — Recall@10, named with a baseline (BM25 at 0.61). Eval method — 500-query labeled eval, scoped enough to mean something. Scope — 220k tickets, plus a p95 latency budget the system held to. Three of the five dimensions a senior reviewer checks for on every RAG bullet, all in one line. The same frame applies in an adjacent role: a machine learning resume bullet swaps the retrieval metric for a model metric, but the structure is the same.

WHAT PEOPLE SAY

Real feedback from engineers who used it

From Reddit threads, LinkedIn posts, and DMs.

Akhil Ajith

Data Scientist - Senior Consultant @KPMG Ireland

This is super helpful! Thank you for letting me try it. Usually we neglect some low quality bullets and keep it anyways then eventually they hit you back in interviews, this tool has helped me refine those points by asking the right questions and definitely not another AI slop. PS Review is harsh and also there's no magic rewrite so prepare to put some time into it.

Asif Hassam

Founder@ Ziao Coding Boot camp | CapeTown

Absolutely Brilliant. I run coding bootcamps and recommended this tool to all my students. The ones who use it always score more interviews than the ones don’t. It’s a great tool for any developer who is job hunting.

r/resumes

u/Sea-Cranberry-2440d·5 YOE Backend Engineer | US·1 day ago

I just tried it. I really like the idea of providing feedback and suggested changes. The feedback seems pretty sensible as well. I think adding an optional rewrite option to show an example of what each line should look like would be nice. I'll update my resume based on these suggested feedback to see if I can get better results. Thank you.

r/learnmachinelearning

u/rival-bixb·3 YOE Machine Learning Engineer | Canada·4 day ago

Not gonna lie, I was using this tool as a resume builder because it does clean jake resume template exports for free, but I gave the linting feature a go a month back, it's probably the best thing in this tool Sure any LLM can generate feedback but very few does localised ones and then on top of that it makes you correct the point by asking you to justify it. The only downside is this process takes time but honestly, it really helped me get my resume sorted.

Prajjwal Patel

AI automation Engineer | TCS

I used to use Claude and Overleaf before, everytime there's a new opening I would generate latex code and move it to overleaf and track it via notion page. Now I do all of that in Rejectless, the linting feature is probably the highlight because it really makes you think, the feedback is very specific and the fix is only applied after it validates which is really good.

Lint my resume

Chapter V

Breaking in without production AI work

Most candidates breaking into AI engineering in 2026 do not have AI in production. They have a few months of API tinkering, two side projects, maybe a Kaggle competition. The resume challenge is showing real engineering judgment from a portfolio that is mostly self-directed.

The bullets work the same way they do for production work, with one caveat: scope honesty matters more, because the reviewer already knows this is a side project and is mentally adjusting for that. Overclaiming on a side project breaks trust faster than anywhere else. The example below is the shape an entry-level AI engineer resume should take: a labeled dataset built by hand, a fine-tune with a held-out eval against a named baseline, an OSS contribution with a merged PR, and one honest scope statement on the internship work.

Entry-level AI engineer (portfolio + Kaggle)

A new-grad breaking into AI engineering with no production AI work. The resume earns its place by treating side projects the way a senior engineer treats production work: a labeled dataset built by hand, a fine-tune with a held-out eval, and an honest scope statement on every project. The weak bullet is the overclaim that almost always slips into junior resumes.

TipClick any flagged bullet to read the reviewer’s margin note

Arjun Mehta

arjun.mehta@email.com | github.com/arjunm-ai | kaggle.com/arjunm

Education

University of Toronto

BS, Computer Science (AI specialization)2026

Experience

FinHealth (Internship)Summer 2025

ML Engineering InternRemote

Built a labeled eval set of 600 customer support tickets across 8 intent categories; reviewed every label with a senior engineer and used the set to gate the team's first fine-tune of a 3B-parameter base model.
Trained a LoRA fine-tune of Mistral-7B on the labeled set; macro-F1 of 0.84 on a 150-row held-out split vs 0.79 for the few-shot GPT-4o-mini baseline; documented the bake-off and the cost trade-off ($0.0004 per classification vs $0.011) in an internal write-up.
Wrote the team's first retrieval eval script (Recall@k against a 200-query labeled set) and integrated it into the CI pipeline; surfaced and fixed a chunk-boundary bug that had been silently dropping ~14% of correct passages from the top-10.
Shipped production AI features.

Projects

Open-source RAG eval harness | github.com/arjunm-ai/rag-eval

Self-built labeled retrieval eval set of 1,100 queries over the AskUbuntu Stack Exchange corpus, with 4-point relevance grades I authored over 4 weekends; harness compares 6 embedding models on Recall@10 / NDCG@10 with cost-per-1k-embeddings reported alongside each result.
180+ GitHub stars; cited by 2 third-party blog posts on choosing an embedding model under cost constraints; full corpus + label CSV + harness scripts in repo.

Kaggle - LLM hallucination grading | 165th / 1,420 teams (top 12%)

Built a calibrated LLM-as-judge for the LMSYS hallucination grading task; validated the judge against 400 organizer-labeled responses with Spearman 0.74 before submitting; final solo ranking 165 / 1420.
Wrote up the calibration step as a public notebook; 8,200 views and a top-5 'most upvoted notebook' badge for the competition.

OSS contribution - vLLM | github.com/vllm-project/vllm

Two merged PRs against the vLLM repo: one improving the prefix-cache hit-rate metric for evals (+34 / -8), one fixing a tokenizer race condition surfaced under stress testing (+12 / -42).
Wrote the regression test that now blocks the tokenizer bug from re-entering main; reviewed and merged by a maintainer in 9 days.

Technical Skills

Stack: PyTorch, HuggingFace, vLLM, pgvector, Sentence-Transformers

Models: Llama-3, Mistral, GPT-4o, Claude 3.5 (project-scale only)

Languages: Python, Rust, SQL

The reviewer’s margin notes

4 notes

Click a flagged bullet or a note to highlight its pair.

StrongStrong: built the labeled set, not just used one

Built a labeled eval set of 600 customer support tickets across 8 intent categories; reviewed every label with a senior engineer and used the set to gate the team's first fine-tune of a 3B-parameter base model.

Most junior AI resumes use a metric and pretend the eval set already existed. Naming the work that built the set (600 tickets, 8 categories, reviewed with a senior) is the part that proves the candidate understood evaluation is an engineering problem, not a side-effect.

StrongStrong: baseline + held-out eval + cost trade-off

Trained a LoRA fine-tune of Mistral-7B on the labeled set; macro-F1 of 0.84 on a 150-row held-out split vs 0.79 for the few-shot GPT-4o-mini baseline; documented the bake-off and the cost trade-off ($0.0004 per classification vs $0.011) in an internal write-up.

A junior fine-tuning bullet that does not name a baseline is doing nothing. This one names GPT-4o-mini few-shot, reports the F1 lift on a held-out split, and includes the cost-per-classification delta. Senior reviewers read this and immediately update their prior on the candidate.

RewriteRewrite: 'shipped production AI features'

Shipped production AI features.

This is the single most damaging overclaim a junior resume can make. The interviewer asks one question about request volume or on-call ownership and the credibility collapses. The other three bullets in this role are strong; this one drags the average down.

↳ Example rewrite

Prototyped a chunk-strategy A/B framework on top of the team's retrieval pipeline; demoed to the team and identified a 7% Recall@10 lift the team chose not to ship before my internship ended (left as a project handoff doc with reproducible scripts).

StrongStrong: caught a real bug in CI

Wrote the team's first retrieval eval script (Recall@k against a 200-query labeled set) and integrated it into the CI pipeline; surfaced and fixed a chunk-boundary bug that had been silently dropping ~14% of correct passages from the top-10.

The 14% recovered-passage number is the kind of detail an intern only gets to write if they actually ran the eval against production. The CI integration framing is also right; senior reviewers read it as 'this person understands evaluation is a system, not a script'.

Takeaway

Entry-level AI engineer resumes do not need production. They need a labeled dataset you built, a held-out eval, a named baseline, and the discipline to call a side project a side project. Overclaiming is the only mistake a junior cannot recover from in interview.

Chapter VI

Questions

–Is AI engineer the same as machine learning engineer on a resume?

Not quite. ML engineer resumes lead with model accuracy, training cost, and inference throughput. AI engineer resumes — especially since 2024 — lead with retrieval quality, LLM eval scores, cost-per-request, and system design around model APIs. Both use the same metric-method-scope frame, but the emphasis shifts. If your work is mostly LLM, RAG, agent, or fine-tuning, write it as an AI engineer resume.

–Do I need a PhD or ML background for an AI engineer role?

No. Most AI engineer postings in 2026 want shipped LLM systems, not papers. What matters on the resume is retrieval quality, eval rigor, and cost discipline. A PhD goes in education. The bullets prove you can build, evaluate, and ship the kind of LLM-powered system that hits real traffic.

–How do I write RAG bullets if I don't have production traffic?

Build a labeled eval set, even a small one. 500 queries with relevance labels you authored yourself is enough to report Recall@k or NDCG with a real baseline (BM25 or a single-vector dense baseline). Name the corpus size, the chunking and embedding choice, and the metric. A side-project RAG bullet with a real eval beats a production RAG bullet with no metric.

–Should I list LangChain, LlamaIndex, and other GenAI frameworks?

Only the one or two you actually built with, and only when tied to a measurable outcome. A line that lists LangChain, LlamaIndex, Pinecone, ChromaDB, Weaviate, Qdrant, and FAISS in one bullet signals you played with examples, not that you built with intent. List what you shipped. The rest is noise.

–What's the difference between an AI engineer and LLM engineer resume?

LLM engineer resumes go deeper on model behavior — fine-tuning, RLHF, quantization, serving choices like vLLM and TGI. AI engineer resumes are broader — retrieval, agents, evals, model routing, and system design around the model. If your work is mostly inside the model boundary, write LLM engineer. If your work is mostly around the model, write AI engineer. Most resumes blur the two; this is fine if the bullets are specific enough to make the work clear.

–How do I write agent bullets that don't read as overclaiming?

Name the tool count, the turn limit, the loop-detection logic, the median tool calls per run, and the success rate across a real run count. "Built a Claude-based agent with 6 internal tools; median 3.2 tool calls per run; 94% task completion across 4,200 production runs" is an agent bullet. "Deployed an AI agent" is wallpaper. The detail proves you shipped it; the absence of detail signals you did not.

–Should I list every model I've worked with — GPT-4, Claude, Gemini, Llama, Mistral?

Only models that appear in a project with a measured outcome. A skills line that lists eight model families with no project that named more than two of them reads as keyword-matching. Name the two or three you actually shipped with, in the context of the work that used them.

–How long should an entry-level AI engineer resume be?

One page, with project bullets that look like AI engineer bullets — not bootcamp deliverables. Build one labeled dataset yourself. Run one fine-tune or RAG eval against a named baseline. Show the metric, the method, and an honest scope statement. The entry-level example on this page is exactly that shape.

Two ways to start — your turn.

Paste your résumé and get the same line-by-line marks the examples got — no rewrites, no ATS games, no AI voice. Or start a fresh one in the same Jake’s format if the page you have is past saving.

Path A · Lint

Mark up the résumé I have

Paste it in, get per-bullet feedback, rewrite only the lines that need it. Free, no signup.

Open the linter →

Path B · Build

Start a fresh résumé in the same format

Use the Jake’s template every example on this page is rendered in. No LaTeX, one page, ATS-clean.

Start building →