AI medical scribe RFP framework 2026: 12 questions vendors won’t answer without a push
By 2026 there are 30+ AI scribe vendors actively selling into US healthcare. Every one of them does the same demo: hand them an audio sample, watch the SOAP note appear in 90 seconds, gasp appropriately. The demo is a sales tool optimized for the wow moment, not for the 18-month-from-now experience of being locked into the wrong vendor. This post is the 12-question RFP every clinic, group, or DSO should send to AI scribe vendors before signing — the questions designed to extract the answers vendors won’t volunteer.
This is not a buying guide that recommends a specific vendor. It’s the diligence checklist that lets you compare apples to apples and surface the trade-offs you’ll regret missing.
The 12 questions
1. What is the all-in cost per provider per month at our anticipated volume? Include any volume tiers, overage fees, EHR integration fees, training fees, and minimum commitments.
Why ask: list pricing is rarely what you actually pay. Suki at $299 may have $99 EHR integration setup, $199 implementation, $0.05/min overage above 30 hours/provider/mo, and a 12-month minimum. Real cost in year 1 can be 1.5-2x list. Get the all-in number before you compare to other vendors.
2. Who is the underlying ASR vendor? Whisper, Deepgram, AssemblyAI, in-house, or other?
Why ask: most AI scribes are wrappers around the same 3-4 ASR engines. If two vendors both use Whisper Large v3 the transcription quality is essentially identical — you’re paying for the LLM template + workflow, not the ASR. If a vendor refuses to disclose, that’s data on its own.
3. Which LLM generates the structured note? OpenAI GPT-4o, Anthropic Claude, Google Gemini, in-house fine-tuned, or other? Has it changed in the last 6 months?
Why ask: the LLM is the difference. A vendor that quietly swapped from Claude Opus to a cheaper model degraded their note quality but kept the same price. You want to know what model is generating your notes today and whether the vendor commits to model versioning.
4. Is your model fine-tuned on customer audio? Specifically, will our audio be used to train your future models?
Why ask: this matters for HIPAA, for competitive moat (your clinical patterns leaving your control), and for note quality (overfitting risk). Some vendors opt-in to using customer data; others claim no-train-on-your-data; some have ambiguous BAAs. Get it in writing — specifically whether de-identified data falls under the same restriction.
5. Show us the BAA chain end-to-end. List every subcontractor that touches PHI, including the LLM provider.
Why ask: under HIPAA you are responsible for ensuring every link in the chain has a BAA with the link upstream. Many vendors have a BAA with you but a weak BAA chain to OpenAI/Anthropic upstream. If their LLM provider drops their HIPAA tier, your data is suddenly exposed and you may not even know. The vendor should give you a chain diagram.
6. What happens to our notes and audio if we cancel? Do we get exports? In what format? How long do you retain after termination?
Why ask: vendor lock-in via data hostage is a real pattern. The right answer: full export available in standard formats (FHIR, .docx, .csv) with no fee, retention deletion within 30-90 days post-termination on certified destruction. The wrong answer: “we provide read-only access for 90 days” which means you can’t migrate cleanly.
7. What EHR integrations are production-quality vs “in development”? Specifically Epic, Cerner, Athena, eClinicalWorks, NextGen, Allscripts, Practice Fusion, and DrChrono.
Why ask: every vendor lists every EHR. Most have one or two production-quality integrations and the rest are “custom for enterprise” or “Phase 2” or “via Showroom marketplace partner”. Demand a list of customers using each integration in production. If they can’t name a customer for your EHR, the integration is theoretical.
8. Show us your edit-rate metrics. What percentage of notes generated by your model require clinician edits before signing? Broken down by specialty.
Why ask: this is the only honest quality metric. “95% accuracy” in vendor decks usually refers to WER on clean audio, not note-level accuracy after edit. The right answer comes with specialty breakdown (primary care 8-12% edit rate, behavioral health 20-35%, etc) and is gathered from production customer telemetry, not internal benchmarks.
9. What does your roadmap look like for the next 12 months? What features that are charged today will be standard, and vice versa?
Why ask: AI scribe pricing is reshuffling rapidly. Speaker diarization was a $50/mo add-on in 2024, standard in 2026. The next 12 months will likely commoditize specialty templates and EHR write-back. You don’t want to lock into a 36-month contract on a feature that’s about to be free.
10. What is your incident response SLA for a model regression? If your note quality degrades next month, what is the customer notification + remediation process?
Why ask: silent model swaps happen. The right answer is a published SLA for proactive notification of model changes, a sandbox where you can compare old vs new on your own audio, and a rollback path if the new model is worse for your specialty. The wrong answer is “we continuously improve our models” with no governance.
11. Do you offer a per-encounter or pay-as-you-go pricing tier? If not, why?
Why ask: per-provider seat pricing is great for vendor revenue, often bad for clinics with variable-volume providers (vacationing, sabbaticals, part-time). Vendors that refuse to offer per-encounter pricing are usually the ones with the worst unit economics for low-volume customers. The DIY $0.05/min Whisper path always wins for low or variable volume.
12. Provide 3 customer references at clinics of our size and specialty, including one customer who switched away from a competitor and one who has been with you for 18+ months.
Why ask: references self-selected by the vendor are biased but still informative. You want to talk to a customer who switched (will tell you what was bad about the previous vendor and may volunteer what’s annoying about the current one), and a long-tenure customer (will tell you what’s changed in 18 months — the issues that didn’t exist at signing).
The 3 contract clauses that will bite you
Auto-renewal
Most enterprise SaaS auto-renews for the same term unless cancelled 60-90 days before expiration. AI scribe vendors are aggressive on this because annual prepay smooths their cash flow. The clinic-friendly clause: month-to-month after the initial term, or auto-renewal with 30-day cancel right.
Price escalation
Many enterprise contracts include a 5-7% annual price escalation built in. This may be reasonable for legacy software but in a category where unit costs (LLM tokens, ASR compute) are dropping 30-50% per year, paying more is wrong. The clinic-friendly clause: price held flat for the term + auto-renewal at the same rate, OR a transparent indexed escalation tied to a specific input cost.
Data ownership and use rights
Read the clause that defines who owns the audio, the transcript, the structured note, and any aggregated analytics. The default in most vendor templates: they own derivative works (analytics, anonymized training data) and you own the patient-identifiable note. The clinic-friendly clause: clinic owns all forms of the data including aggregated and anonymized; vendor cannot use derivatives without explicit per-instance consent.
The DIY benchmark every RFP should include
Even if you ultimately pick a brand-name vendor, run the same audio through a DIY LessRec-style stack as your benchmark. The DIY stack costs ~$0.50-1.50/encounter and uses the same Whisper + Claude pipeline most vendors use under the hood. If the vendor’s output isn’t materially better than your DIY benchmark on your own audio, you’re paying for the EHR integration, the BAA chain, and the support — not the AI quality.
This benchmark also gives you negotiating leverage. “Your output and our DIY output are clinically identical. The price delta needs to be justified by integration and support, which is worth $X to us.” That’s a defensible negotiation framing, not a feeling.
How to score vendor responses
| Question | Weight | What “good” looks like |
|---|---|---|
| 1. All-in cost | 15% | Itemized; no surprise overages; minimum ≤ 12 mo |
| 2. ASR vendor | 5% | Disclosed; Whisper Large v3 or equivalent |
| 3. LLM vendor | 10% | Frontier model (Claude Sonnet 4.6, GPT-4o, Gemini 2.5); committed versioning |
| 4. Training opt-out | 10% | Default no-train; explicit clause; covers de-identified |
| 5. BAA chain | 10% | Diagram provided; full chain to LLM in writing |
| 6. Exit terms | 10% | Free export, standard format, ≤ 90 day post-term retention |
| 7. EHR integration | 10% | Production reference for your specific EHR |
| 8. Edit rate | 15% | Specialty-specific number from production telemetry; matches DIY benchmark |
| 9. Roadmap | 5% | Specific features, not vague themes; commitment to inclusion not paid add-on |
| 10. SLA | 5% | Published model-change notification policy |
| 11. Per-encounter pricing | 3% | Available, even if not the recommended tier |
| 12. References | 2% | Switched-from + long-tenure both available |
The bottom line
Most AI scribe RFPs in 2026 will get vendor answers that look glossy and identical. The questions above are designed to surface the differences vendors hide. Run the 12 questions, weight the responses, and benchmark against a DIY stack. The vendor whose answers are most specific and whose contract terms are most clinic-friendly is almost always the better long-term partner, even if their list price isn’t the cheapest.
If you want to see what the DIY benchmark looks like for your own audio, send a 5-minute sample through LessRec — first 10 minutes free, no signup — and use that output as the floor in your vendor evaluation.
Try LessRec free → benchmark vendors against $0.05/min Whisper