I'm an epidemiologist turned data scientist, currently working as a Staff Data Scientist at One Medical. My work has included analytic and data science support, including program/intervention evaluation, identifying data-driven insights, and building machine learning models. Often, this involves being a "data thought partner" for stakeholders to help break down nebulous problems into a series of actionable experiments. While my work history has primarily been in healthcare, my interests span virtually all applications of data in consumer technology. In my free time, I enjoy volunteering on marine mammal rescues and rehabilitation.
Patients seek care for hundreds of different reasons; being able to predict why a patient might book an appointment in the near future has valuable implications for both visit forecasting and design of a patient self-booking flow. However, this problem is quite difficult: data is incredibly sparse and naturally limited, and there are hundreds of classes we’re attempting to predict.
Rather than use traditional classification methods like random forests or regression, I created a network based on historical appointment data and used a version of PageRank to identify likely reasons for visit. This approach was successful about one-third of the time, which was substantially better than other attempted approaches.
We started out just wanting to improve depression screening rates in our clinics, and intuitively understood the importance of clinics using data to drive that improvement. However, as I built out the reports, it was clear there were two different beliefs on our team: one group believed trended rates were the most valuable, while the other believed that raw numbers of patients would be more impactful.
So, in response, I built two versions of the reports, designed a randomized trial to test their effectiveness, and launched and analyzed the results. We found that keeping reporting on a patient-level led to around a 20% increase odds of patients being screened for depression. These results were presented at the 2020 Virtual AMIA Informatics Summit.
Our technology team implemented a new feature that allowed for the association of synonyms within the medication search results. We added branded medication names as synonyms to their corresponding generic medication; this way, when a provider searches for a brand (e.g., “Lipitor”) they will first be shown the corresponding generic medication (e.g., “atorvastatin”). Generic medications tend to be substantially less expensive than their branded counterpart, so this can have direct financial impacts for patients, providers, and payors.
I evaluated the impact of this change use a pre-post observational design. The results varied based on medication, but did find a universal improvement in the percent of medications prescribed as generic. We subsequently expanded this four-medication trial to all relevant medications.
One major problem with using electronic health record data for secondary purposes is that there is that many relationships end up being confounded by patient complexity. That is, treatments are given to sicker patients so you must disentangle the effect of the treatment from the fact the patients receiving the treatment were sicker to begin with. In this case, we wanted to identify patients whose high blood pressure was not improving in order to target them for additional outreach.
I built a deep neural network to predict the patients hypertension status in six months. In addition to standard features, I included an additional layer of nodes for treatment propensity. These individually-trained features were able to account for the confounded relationship between complexity and treatment, and led to improvements in model performance.
More Info: Abstract
Application usability is a frequent concern, especially in the context of electronic heath records and physician burnout. Part of why physicians burn out is the difficulty associated with doing the same activities over and over again. These activities quickly become predictable: if a patient is coming in for insomnia, there are a finite set of medications a provider is likely to prescribe. Why make them type them into a search box when you could just present them to begin with?
I did exactly that using association rule mining (sometimes called “market basket analysis”), looking at what reason for visit concepts were associated with different activities like lab orders, medications, referrals, or diagnoses. This approach is also plainly interpretable, which is useful in highlighting relevant information to providers to help make the decisions transparent (and thus avoid becoming a medical device under FDA rules). This won the “most geektastic” award at an internal hackathon, and was later presented at an American Medical Informatics Association meeting.
For many acute conditions (e.g., sprained ankle, cold), patients do not follow-up unless their condition gets substantially worse. This leads to many unresolved problems cluttering electronic health records, as well as a lack of data about efficacy of treatment for those conditions. The technology team implemented a new feature that would send automated follow-up checkins to patients recently seen for acute problems. Providers could opt-out of having these sent, and patients could simply not respond.
To better understand how this feature was being used, I built two penalized logistic regression models: one predicting if a provider would opt-out and another predicting if a patient would respond. These models found multiple interesting association, leading to tweaking of the roll-out of this feature.
More Info: Abstract
Patients send thousands of messages to their doctors each year, and many of them are about non-clinical issues that end up being resolved by administrative staff. This is thought to contribute to provider burnout, and likely leads to increase response time for patients.
Colleagues of mine created a training dataset of messages and built an initial NLP model to identify messages that could be moved out of clinical inboxes. I did the performance evaluation and model tweaking, evaluated additional (non-NB models), and provided the interpretations of these models. This problem proved to be a good use case for NLP, and versions of this model are currently in production.
Anxiety is a highly-prevalent condition, and is often managed by medications or expensive one-on-one therapy. My collaborators developed a new mindfulness-based group visit program to help patients with anxiety. After running the program for over a year, they approached me asking for assistance is evaluating the program’s effectiveness.
I conducted a pre-/post- statistical analysis of the program, finding significant improvements in both the clinic utilization and severity of anxiety symptoms. The results of this study were presented at the Academy Health conference.
Gene expression data can contain tens of thousands of data points indicate how much particular genes are being used. When someone becomes infected with a disease, say the common cold, their body kicks off a series of changes to fight that infection. I was curious if those changes were detectable at the gene level.
Using published data from multiple studies, and using statistical approaches to correct for differences among those studies, I trained multiple support vector machines to predict the infectious agent a particular patient had. I then used recursive feature elimination to identify the specific genes that were most over- or under- expressed when a patient gained that infection.
HIV treatment was revolutionized with the development of “highly active antiretroviral therapy” (HAART) in 1996: a three- drug cocktail targeting two distinct mechanisms unique to HIV. This proved to be incredibly effective at stopping the progression of HIV. While new drugs continue to be developed, HIV’s rapid mutation rate has led to the development of resistant strains. I sought to understand the impact of antiretroviral medications on the genomic evolution of HIV-1.
Using publicly available HIV genome data, I calculated nucleotide diversity measures and used those to calculate the rate of genomic evolution. I then trained a support vector machine to predict treatment resistance, and used recursive feature elimination to identify specific regions of the genome that appear to be driving resistance.
As we’ve gained more data and information about disease processes, we’ve learned that diseases we once thought of as homogeneous (e.g., cancer) are actually comprised of many different diseases that all present similarly. Now, with ubiquitous electronic health record data, it’s increasingly possible to find this heterogeneity of diseases that was previously impossible (informaticists call this “ehr-based clinical phenotyping”). The research group I worked for set out to do this for pre-diabetes.
I used association rule mining (ARM) to identify groups of patients that were likely to progress to diabetes. I then used penalized logistic regression to generate propensity scores for patients receiving statins, and used those scores to match patients within the ARM-defined groups. Through this process, three different groups of patients, with different risks of diabetes progression, were identified. This work was presented at an American Medical Informatics Association meeting.
More Info: Paper
Resourcing care coordinators has been an ongoing issue, made particularly prominent by the changing payment schemes in healthcare (e.g., patient-centered medical homes, accountable care organizations, value-based contracts, etc). One common approach is to pay clinics based on the medical complexity of their patients, reasoning that more complex patients likely take more time and resources from care coordinators. However, anecdotally, that was not the case - this project set out to understand what the actual drivers of care coordination time are.
Two different data sources were identified: a database compiled by case managers and care coordinators (including utilization billed in 15 minute increments) and data from electronic health records. I combined this data and used a linear model trained with stepwise regression to identify drivers of care coordination time. Social factors, like housing status, were associated with substantially higher coordination time than medical complexity. This work was presented at an American Public Health Association meeting.