Phenotyping: high precision in the creation of patient cohorts based on electronic health record data.

SporeData services

Jack is on a mission. He is attempting to create a cohort of patients in the Intensive Care Unit with a diagnosis of ruptured cerebral aneurysms. While Jack is excited as this would become a significant platform for his research program, the challenges of selecting patients while using an electronic health record (EHR) keep on increasing. For one, ICD-10 codes in the EHR are not that entirely reliable. Either they will miss cases (false negatives) or will include the ones who are not with an aneurysm. Identifying information that is not on a spreadsheet (data frame) format is also not that trivial, and Jack doesn’t have the resources to have chart abstractors go through thousands of cases.

Jack is tapping into something big, and there are certainly brand new resources to assist him. Below we make some suggestions about the most promising approaches we have been tracking and testing in projects as part of researchdesigneR, our Artificial Intelligence-based system for decision support:

  1. Phecap (Zhang et al. 2019). PheCAP is a phenotyping approach where the data are initially extracted from the EHR – often using ICD codes. Then, Phecap uses a combination of spreadsheet-like variables as well as variables extracted from free-text (admission and discharge reports, progress notes) to feed machine learning models (unsupervised and supervised). The output of these models is the probability of a patient having a certain condition or intervention. This probability is calculated based on a gold standard (usually around 200 patients) created, in parallel, by chart abstractors.
  2. Noisy silver standard. Known for its association with the APHRODITE statistical package (Banda et al. 2017), the noisy silver standard uses an iterative approach. It starts by using regular expressions to search for terms that relate to the diagnosis or treatment of interest. This first, imperfect gold standard, is used to feed a supervised model (frequently a convolutional neural network) and generate thousands of millions of patient records. A sample of those is reviewed by humans to improve the precision of the original regular expression algorithm. Data scientists repeat this cycle until they reach a performance plateau in comparison with a gold standard.
  3. BERT. BERT, created by Google, generated a new frenzy when it released the project as an open-source Natural Language Processing (NLP) project in 2018. The reason why it is so powerful is that it analyzes free-text in a manner that is somewhat similar to what we humans do. Essentially, rather than analyzing words in isolation, they are analyzed in context. For example, when we say that the diagnosis was right and that the affected arm was the right one, the two uses of the word “right” mean something completely different. BERT captures that. While BERT is not a phenotyping method per se, its use in phenotyping significantly increases the precision of the whole process.


Banda, Juan M, Yoni Halpern, David Sontag, and Nigam H Shah. 2017. “Electronic Phenotyping with Aphrodite and the Observational Health Sciences and Informatics (Ohdsi) Data Network.” AMIA Summits on Translational Science Proceedings 2017: 48.

Zhang, Yichi, Tianrun Cai, Sheng Yu, Kelly Cho, Chuan Hong, Jiehuan Sun, Jie Huang, et al. 2019. “High-Throughput Phenotyping with Electronic Medical Record Data Using a Common Semi-Supervised Approach (Phecap).” Nature Protocols, 1–19.