Abstract The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators. In version 1.0 of the database, these ECGs from Massachusetts General Brigham hospital sites were provided without labels or metadata, to enable pre-training of ECG analysis models. In version 2.0, metadata is included. In version 3.0, Emory ECGs are included together with metadata, labels from the 12SL ECG analysis program (GE Healthcare ) and ICD-9/10 codes. In version 4.0, typos were corrected in the data description. HEEDB is published as part of the Human Sleep Project (HSP), funded by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children's Hospital, and Beth Israel Deaconess Medical Center. Background These ECG data include clinical ECGs captured during routine clinical care over several decades. These are intended to be used to determine associations between cardiac abnormalities (e.g. abnormal rhythms) and sleep, sleep-related medical conditions, and health outcomes. Methods The dataset consists of standard 12-lead ECG recordings, each 10 seconds long, acquired at sampling rates of 250 or 500 Hz. Collection began in the 1980s and continues to the present day. Version 4 of the database includes 10,471,531 ECGs from 1,818,247 unique patients at institution I0001 and 968,680 ECGs from 349,548 patients at institution I0006. All recordings were obtained as part of routine clinical care. Data preprocessing: Data was de-identified following the Safe Harbor method. Data Description ECG data is stored in WFDB (Waveform Database) and Matlab (V4) compatible format. Each ECG recording includes one waveform data file (.mat for I0001 and .dat for I0006) and one header file (.hea). The waveform data file can be read by WFDB library functions, applications, Toolbox, or be loaded to Matlab directly. Waveform files are 12-lead ECG signals recorded at 250 and 500 Hz for 10 s encoded in 16 bits. The header file specifies the name of the associated waveform file and its attributes including sampling rate, units, channel names and the signal gain. It contains line-oriented and field-oriented ASCII text and can be read by the WFDB library or generic text editors. The directory structure of the HEEDB is organized as follows: ECG/ ├── I0006/ │ ├── 12SL_diagnoses/ │ │ ├── diagnoses.csv │ │ ├── diagnoses_dictionary.csv │ │ └── README │ ├── ICD_codes/ │ │ ├── icd9_codes.csv │ │ ├── icd10_codes.csv │ │ └── README │ ├── metadata/ │ │ ├── metadata.csv │ │ └── README │ └── WFDB/ │ ├── 2010/ │ ├── 2011/ │ ├── ... │ └── 2018/ ├── I0001/ │ ├── 12SL_diagnoses/ │ │ ├── diagnoses.csv │ │ ├── diagnoses_dictionary.csv │ │ └── README │ ├── ICD_codes/ │ │ ├── icd9_codes.csv │ │ ├── icd10_codes.csv │ │ └── README │ ├── metadata/ │ │ ├── metadata.csv │ │ └── README │ └── WFDB/ │ ├── S0001/ │ ├── S0002/ │ ├── S0003/ │ └── S0004/ Each institution (I0001 and I0006) maintains its own subfolders for diagnoses, ICD codes, metadata, and waveform files. The WFDB/ directory contains the ECG waveform data organized either by year (I0006) or by session identifier (I0001). 12SL Diagnoses Description The 12SL_diagnoses/ folder contains diagnostic outputs from 12SL (GE Healthcare) software, version 1. File: diagnoses.csv This file contains two columns: FileName – Path to the corresponding WFDB file codes – Diagnostic codes, which can be mapped to text labels using diagnoses_dictionary.csv File: diagnoses_dictionary.csv This file provides human-readable mappings for 12SL diagnostic codes. It contains the following columns: codes – Integer codes for diagnoses acronym – Abbreviated diagnosis labels diagnoses – Full textual descriptions of diagnoses ICD Codes Description The ICD_codes/ folder contains diagnostic information extracted from Electronic Health Records (EHR) for each patient. File: icd10_codes.csv This file contains diagnostic codes from the 10th revision of the International Classification of Diseases (ICD-10), developed by the World Health Organization (WHO). These alphanumeric codes represent diagnoses and health conditions. Columns: BDSPPatientID – Brain Data Science Platform Patient ID RECORDED_DT – Shifted date of the diagnosis DIAGNOSIS_ICD10_CD – Full ICD-10 diagnosis code DIAGNOSIS_ICD10_DESC – Description of the ICD-10 diagnosis code File: icd9_codes.csv This file contains diagnostic codes from the 9th revision of the International Classification of Diseases (ICD-9), also developed by the WHO. These codes are also sourced from the EHR system. Columns: BDSPPatientID – Brain Data Science Platform Patient ID RECORDED_DT – Shifted date of the diagnosis DIAGNOSIS_ICD9_CD – Full ICD-9 diagnosis code DIAGNOSIS_ICD9_DESC – Description of the ICD-9 diagnosis code Metadata Description The metadata/ folder contains demographic and temporal information associated with each ECG recording, including ECG acquisition time, date of birth, date of death, and derived age-related fields. File: metadata.csv Columns: BDSPPatientID – Patient ID FileName – Path to the WFDB file FileID – Basename of the WFDB file PatientRace EthnicGroupDSC MaritalStatusDSC ReligionDSC LanguageDSC VeteranStatusDSC SexDSC PrimaryCauseOfDeathDSC PrimaryCauseOfDeathUNOS FirstContributoryCauseOfDeathDSC FirstContributoryCauseOfDeathUNOS SecondContributoryCauseOfDeathDSC SecondContributoryCauseOfDeathUNOS EducationLevelDSC GenderIdentityDSC SexAssignedAtBirthDSC DateOfDeath DateOfDeathMARegistryData – Massachusetts (MA) state death registry date of death LastKnownVisitDate – Last time the patient had contact with the hospital system ECGAcquisitionTime – Time of ECG acquisition DateOfBirth AgeAtAcquisition – Age at ECG acquisition AgeAtDeath – Age at time of death AgeAtDeathMA – Age at death according to MA state registry AgeAtLastVisit – Age at the last hospital contact For I0006, the following columns are missing from the metadata.csv file: EthnicGroupDSC, MaritalStatusDSC, ReligionDSC, LanguageDSC, VeteranStatusDSC, PrimaryCauseOfDeathDSC, PrimaryCauseOfDeathUNOS, FirstContributoryCauseOfDeathDSC, FirstContributoryCauseOfDeathUNOS, SecondContributoryCauseOfDeathDSC, SecondContributoryCauseOfDeathUNOS, EducationLevelDSC, GenderIdentityDSC, SexAssignedAtBirthDSC, and DateOfDeathMARegistryData. Usage Notes HEEDB is intended to support a wide range of ECG studies, in particular those exploring the relationship between ECG conditions and sleep. Release Notes v1.0 : Initial release containing 10,608,417 ECGs from 1,818,247 subjects (I0001 site only). v2.0 : Added additional data files v3.0 : Expanded to include two ECG institutions — I0001 (10,608,417 ECGs from 1,818,247 subjects) and I0006 (1,061,598 ECGs from 349,548 patients). Metadata, 12SL diagnostic codes, and ICD-9/10 diagnosis codes were also added. Duplicate ECGs were removed from the I0001 site, and incorrect sampling frequencies in header files were corrected. v4.0: Corrected typos in the data description Ethics The study protocol was approved by the Institutional Review Boards of the Massachusetts General Hospital (protocol # 2013P001024) and Beth Israel Deaconess Medical Center (protocol # 2022P000417). The written informed consents were waived, because of the retrospective study design. The study also complied with the Declaration of Helsinki. Acknowledgements Publication of HEEDB is supported by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children's Hospital, and Beth Israel Deaconess Medical Center Conflicts of Interest Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals. The other authors declare that they have no conflicts of interest.