← Back to Technical Library
Medical Data Privacy & AI Training
The Uncomfortable Truth About Patient Data in AI Models
⚠️ The Question Nobody Answers:
When you use a commercial AI system with patient data, is that data being used to train
their models? Most vendors evade this question. The answer determines whether your patients'
PHI is becoming part of a commercial product sold to your competitors.
1. Is Your Patient Data Training AI Models?
This is the billion-dollar question. Many AI vendors' business models depend on continuously improving
their models using customer data. Here's what's actually happening:
How Patient Data Becomes Training Data
Your EHR
Patient records
→
AI Vendor
Processes PHI
→
Training Pipeline
"De-identified"
→
Improved Model
Sold to competitors
Vendor Business Models:
| Model Type |
Uses Your Data for Training? |
Can You Opt Out? |
Examples |
| Traditional SaaS |
No (subscription-only revenue) |
N/A (not collected) |
Some EHR-embedded AI |
| Platform Model |
Yes (data improves product for all) |
Sometimes (check BAA) |
Many cloud AI services |
| Freemium |
Yes (you're the product) |
Rarely |
Free AI tools, trials |
| Research Partnerships |
Yes (explicit research agreements) |
Requires patient consent |
Academic medical centers |
🚩 Evasive Answers to Watch For:
"We use data to improve our services" — This almost always means training. Ask:
"Does 'improve' mean updating models with customer data?"
"All data is de-identified before use" — De-identification often fails (see below).
Ask: "What specific de-identification method? HIPAA Safe Harbor or Expert Determination?"
"We don't sell your data" — Technically true if they're selling model
improvements derived from your data, not the raw data itself.
"Data is only used with consent" — Whose consent? Yours? The patient's? Clarify
who can consent and what they're consenting to.
2. Why De-Identification Often Fails
HIPAA allows two methods for de-identification: Safe Harbor (remove 18 identifiers)
and Expert Determination (statistical proof of low re-identification risk). Both have
serious limitations in the AI era.
🔍 Re-identification Attacks That Work:
Linkage Attacks: Combine "de-identified" medical data with public datasets
(voter records, social media) to re-identify patients. A 2019 study re-identified 60% of
"anonymous" patients using just ZIP code, birth date, and gender.
Model Inversion: Query an AI model repeatedly to reconstruct training data.
Researchers have extracted faces and medical conditions from supposedly private models.
Membership Inference: Determine if a specific person's data was in the training
set. This alone can reveal sensitive information (e.g., "Was this patient in the HIV clinic dataset?").
📊 Landmark Study (Nature, 2019):
Researchers demonstrated that 99.98% of Americans could be correctly re-identified from any
available dataset using just 15 demographic attributes. "De-identified" medical data is far
from anonymous when cross-referenced with other data sources.
✓ What to Demand:
If a vendor claims data is "de-identified," ask:
(1) Which HIPAA method (Safe Harbor or Expert Determination)?
(2) Who performed the expert determination (if applicable)?
(3) What re-identification risk assessment was done?
(4) Do you apply differential privacy or other modern techniques?
3. Data Sales & Third-Party Sharing
Even if a vendor doesn't directly "sell" your data, they may share it with partners, subsidiaries,
or "trusted third parties" in ways that effectively monetize your PHI.
Common Data Sharing Scenarios:
| Scenario |
Is This "Selling"? |
HIPAA Status |
| Selling raw patient data to data brokers |
Yes |
HIPAA violation without authorization |
| Selling model improvements derived from your data |
Debatable |
Legal gray area (not explicitly prohibited) |
| Sharing with "affiliates" or subsidiaries |
Depends on contracts |
May be permitted under organized healthcare arrangement |
| Sharing with cloud providers (AWS, Azure, GCP) |
No (infrastructure) |
Permitted with BAA |
| Sharing with research partners |
Depends on agreements |
Requires IRB approval or patient authorization |
| Using data to train models sold to competitors |
Indirectly, yes |
Not prohibited by HIPAA (BAA should address) |
🚩 Read the Fine Print:
Check the vendor's privacy policy and BAA for these phrases:
"We may share data with our corporate family" — Could mean any subsidiary,
anywhere, with varying privacy standards.
"We may share with trusted partners" — Who? For what purpose? This is often
undefined.
"We may use data for research purposes" — Whose research? Published?
Proprietary? Patient consent obtained?
"Data may be transferred internationally" — HIPAA doesn't restrict this, but
other laws might (GDPR, state laws).
4. Patient Opt-Out Rights
Do patients have the right to opt out of having their data used for AI training? The answer is
complicated and depends on how the data is used.
Opt-Out Scenarios:
✓ Treatment, Payment, Healthcare Operations (TPO)
HIPAA permits using PHI for TPO without patient consent. If AI is used directly for patient
care (e.g., diagnostic support), patients generally cannot opt out.
✓ Research
Using PHI for research typically requires IRB approval and patient authorization (opt-in
consent). Some research can use waivers, but this is narrowly defined.
✗ Commercial Product Development
Using PHI to train commercial AI models sold to third parties is NOT TPO. This should
require patient authorization, but enforcement is weak and many vendors operate in
this gray area.
✗ De-identified Data
Once data is de-identified per HIPAA, it's no longer PHI and patients have no opt-out
rights. This is why the de-identification method matters critically.
📋 Best Practice:
Implement a transparent patient notification process: "We use AI tools to assist with care.
Your data may be used to improve these tools. Here's what that means, here's what's protected,
and here's how to ask questions." Even if not legally required, transparency builds trust.
5. Questions to Ask Every AI Vendor
Data Usage Interrogation:
| Question |
Acceptable Answer |
🚩 Red Flag |
| Is our data used to train your models? |
"No" or "Only with explicit written consent" |
"We use data to improve services" (vague) |
| Can we opt out of training data use? |
"Yes, via contract amendment" |
"Not possible with our architecture" |
| Do you sell or share data with third parties? |
"No, except infrastructure providers under BAA" |
"We share with partners to enhance offerings" |
| What de-identification method do you use? |
HIPAA Safe Harbor or Expert Determination (with documentation) |
"We anonymize data" (no specifics) |
| Do you apply differential privacy? |
"Yes" or "Not applicable (we don't train on customer data)" |
"What's that?" or silence |
| Can we audit your data usage? |
"Yes, annual audit rights in BAA" |
"Our systems are proprietary" |
| Where is data processed (geographically)? |
US-only data centers |
"Global infrastructure" or unclear |
💼 Service Details:
Avondale.AI offers AI Privacy Audits including BAA review for data usage clauses,
vendor interrogation support, patient notification template creation, and de-identification
methodology assessment. We help you protect patient privacy in the AI era.
Key Takeaways:
- Many AI vendors use customer data to train models — get explicit answers in writing
- De-identification is not anonymity — re-identification attacks are increasingly successful
- "We don't sell data" may be technically true while still monetizing your PHI indirectly
- Patient opt-out rights depend on data use (TPO vs research vs commercial)
- BAA should explicitly prohibit using PHI for model training without consent
- Transparency with patients about AI use builds trust even when not legally required