Data & Preprocessing
The data uses the MIMIC-IV dataset (opens in a new tab), a open source dataset of de-identified health data from patients admitted to the Beth Israel Deaconess Medical Center (BIDMC).
In particular, we selected data from the mimic-cxr
and mimic-iv
tables to create a dataset that includes patient physiology measurements, chest x-ray images, chest x-ray encoded pathologies, and physician notes.
The MIMIC-IV dataset, which is managed by MIT, has a well documented schema and is widely used in the medical research community.
The following describes the steps taken to preprocess and consolidate various datasets for analysis. Our primary data sources include Hospital Transfers
, Physician Notes
, Chest XRay Images
, Chest XRay Metadata
, and Chest XRay Encoded Pathologies
, which contain multilabel target variables.
-
Hospital Transfers
- Consolidation: Merged
careunit
andevent types
into a single column for care type, combinedadmissions
andtransfer IDs
to denote unique hospital visits. - Sequencing: Assigned sequence numbers to visits and transfers per patient, calculated the earliest admission and the latest discharge times.
- Consolidation: Merged
-
Chest XRay Encoded Pathologies
- Pathology Encoding: Processed the patholigies categories to encode pathologies for each patient, retaining only records with one or more pathologies.
-
Chest XRay Metadata
- Deduplication: Removed duplicate instances within the patient studies to ensure unique study records.
Dataset Merging
-
Hospital Transfers and Chest XRay Metadata
- Merged on
subject_id
, filtering for XRay dates between the admission and discharge dates, then re-merged to the hospital transfers usinghosp_transfer_id
.
- Merged on
-
Processed Hospital Transfers and Chest XRay Encoded Pathologies
- Combined using
subject_id
andstudy_id
, filtering out duplicate studies, those originating from the ED, and studies without pathologies.
- Combined using
-
Including Physician Notes
- Integrated physician notes (radiology and discharge) using
subject_id
,study_id
, andnote_hadm_id
, removing any records with missing data.
- Integrated physician notes (radiology and discharge) using
-
Notes Categorization and Cleanup
- Segregated notes into individual categories, parsing the history of illness sections for cleanliness and relevance, excluding records with histories shorter than 30 words.
This structured approach ensures a comprehensive dataset ready for analysis, with specific attention to encoding, merging, and cleaning to support accurate and meaningful insights.
Refining the Dataset
Originally, the dataset contained 8 different pathologies, but we decided to focus on the top 5 most common pathologies: Atelectasis
, Cardiomegaly
, Pleural Effusion
, Lung Opacity
, and No Finding
.
We then encoded these pathologies as binary labels for each patient, with 1
indicating the presence of the pathology and 0
indicating the absence.
We further balanced the dataset by oversampling the minority classes and undersampling the majority class to ensure that the models had a better chance learning the specific features of each pathology.