Selected Publication:
SHR
Neuro
Cancer
Cardio
Lipid
Metab
Microb
Krieger, K; Hameed, I; Quer, G; Mack, C; Savic, M; Mantaj, P; Hirofuji, A; Gregg, A; Soletti, G; Rossi, CS; Rahouma, M; Gaudino, M.
Generative pre-trained transformer reinforces historical gender bias in diagnosing women's cardiovascular symptoms
EUR HEART J-DIGIT HL. 2025;
Doi: 10.1093/ehjdh/ztaf131
Web of Science
FullText
FullText_MUG
- Co-authors Med Uni Graz
-
Mantaj Polina
- Altmetrics:
- Dimensions Citations:
- Plum Analytics:
- Scite (citation analytics):
- Abstract:
- Aims Large language models (LLMs) such as GPT are increasingly used to generate clinical teaching cases and support diagnostic reasoning. However, biases in their training data may skew the portrayal and interpretation of cardiovascular symptoms in women, potentially leading to delayed or inaccurate diagnoses. We assessed GPT-4o's and GPT-4's gender representation in simulated cardiovascular cases and GPT-4o's diagnostic performance across genders using real patient notes. Methods and results First, GPT-4o and GPT-4 were each prompted to generate 15 000 simulated cases spanning 15 cardiovascular conditions with known gender prevalence differences. The model's gender distributions were compared to U.S. prevalence data from large national datasets (Centers for Disease Control and Prevention and National Inpatient Sample) using FDR-corrected chi(2) tests, finding a significant deviation (P < 0.0001). In 14 GPT-4-generated conditions (93%), male patients were overrepresented compared to females by a mean of 30% (SD 8.6%). Second, fifty de-identified cardiovascular patient notes were extracted from the MIMIC-IV-Note database. Patient gender was systematically swapped in each note, and GPT-4o was asked to produce differential diagnoses for each version (10 000 total prompts). Diagnostic accuracy across genders was determined by comparing model outputs to actual discharge diagnoses via FDR-corrected Mann-Whitney U tests, revealing significant diagnostic accuracy differences in 11 cases (22%). Female patients received lower accuracy scores than males for key conditions like coronary artery disease (P < 0.01), abdominal aortic aneurysm (P < 1.0 x 10(-9)), and atrial fibrillation (P < 0.01). Conclusion GPT-4o underrepresented women in simulated cardiovascular scenarios and less accurately diagnosed female patients with critical conditions. These biases risk reinforcing historical disparities in cardiovascular care. Future efforts should focus on bias detection and mitigation. [GRAPHICS]
- Find related publications in this database (Keywords)
-
Large Language Model
-
GPT
-
Gender Bias
-
Cardiovascular Diagnosis
-
Medical Education