How finalist Pre-syndromic Surveillance uses text analysis to detect disease outbreaks

This is the fifth post in our Q&A series featuring the five Hidden Signals Challenge finalists. These solutions demonstrate the exciting potential for open data to help identify biothreats in real time.

Our final post features a Q&A with finalists Daniel B. Neill and Mallory Nobles.Their machine learning system, Pre-syndromic Surveillance, overlays real-time emergency room chief complaint data with social media and news data using the semantic scan, a novel approach to text analysis. The model detects emerging clusters of rare disease cases that do not correspond to known syndrome types.

“We are developing a ‘human in the loop’ interface that allows public health officials to define events of interest and distinguish between events of high interest (‘notify me of any meningitis exposures’) and low interest (‘don’t show me clusters from motor vehicle accidents’). The system will learn continually from this feedback, enabling it to better identify emerging threats, while limiting false positives.”

– Pre-syndromic Surveillance

What inspired you to submit this concept?

Our goal is to provide practical tools that can be used by multiple local, state, and national health departments to improve day-to-day disease monitoring  practice and enhance situational awareness of rare and novel bio-threats. When developing these decision-making aids, we are motivated by the challenge of surfacing the right insights for the user at the right time.

Through a long standing relationship with New York City’s Department of Health and Mental Hygiene, we have developed an approach for uncovering clusters of disease cases that have high relevance to local public health. When our methodology was applied to historical Emergency Department data from New York, North Carolina, and western Pennsylvania, we found clusters related to rare but periodic outbreaks of contagious diseases, like bacterial meningitis and scabies, clusters of drug overdoses, instances of carbon monoxide poisoning or smoke inhalation, and novel events like a cluster of cases from contaminated coffee.

We are excited to have the opportunity to further refine our concept through the Hidden Signals Challenge. The feedback, insight and expert mentorship has been invaluable as we develop plans for implementation and scaling, and consider a system design that best meets local and national health department needs.

How will your concept enable city-level operators to make critical and proactive decisions?

As local public health practitioners use our tool, they will see a ranked set of detected clusters, each consisting of a list of emergency department cases. The public health practitioner can use the system to further investigate each detected cluster. The system will display data for each case in the cluster.  The user will also see graphs and tables that summarize information about the spatiotemporal extent, affected demographics, and textual topic of the cluster, as well as social media data associated with that cluster.

We are developing a “human in the loop” interface that allows public health officials to define events of interest and distinguish between events of high interest (“notify me of any meningitis exposures”) and low interest (“don’t show me clusters from motor vehicle accidents”). The system will learn continually from this feedback, enabling it to better identify emerging threats, while limiting false positives.

When a local public health practitioner determines that a detected cluster may be of interest to other jurisdictions, they can choose to share a summary of the cluster with other local and national users, who will then be able to comment on the summary or automatically search for similar clusters in their own data.

What sets your concept apart from existing solutions?

Many existing systems take a “syndromic surveillance” approach, monitoring pre-established syndrome categories using either diagnostic codes or text classification.  Our concept is a novel methodological approach for “pre-syndromic surveillance” that instead automatically learns the disease categories that are emerging in free-text, pre-diagnostic clinical data.  By integrating free text and structured data, we also enable the detection of disease clusters that are geographically localized or that differentially affect sub-populations.

Using free-text data can improve timeliness of detection by several days, a significant margin for rapidly emerging outbreaks, since patients are often not assigned International Classification of Diseases diagnostic codes until discharge from the hospital.  Free text data also often includes not only symptoms, but other information (recent travel, relevant events and locations), enabling detection of clusters of “fainting on the subway” or “exhibiting unusual symptoms after travel”. Because novel bio-threats can take so many different forms, developing syndromes in advance for every imaginable event is impossible, necessitating a data driven approach.

While syndromic surveillance has improved greatly over our 15 years of work in the field, we strongly believe that pre-syndromic surveillance is a necessary next step which public health agencies need for timely, targeted and effective responses to emerging threats.

What is the biggest insight you’ve uncovered through the Challenge thus far?

At this stage in the Challenge, we have already enjoyed the opportunity to meet with a diverse set of expert mentors, including those from the Health and Human Services (HHS) Office of the Assistant Secretary for Preparedness and Response (ASPR), the National Biosurveillance Integration Center (NBIC) at the Department for Homeland Security (DHS), and the Emergency Management Department for the City of San Francisco.

These conversations have strengthened our understanding of the complexity of how local and national organizations work together to detect and respond to emerging bio-threats.  This understanding has helped us anticipate the needs of our various end users and consider how we can best scale our system to a national level. These conversations have also highlighted the need for our concept to both enhance local detection and also contribute to intra-jurisdictional situational awareness.

Learn more about other finalists’ solutions on the Challenge blog and subscribe to our newsletter to continue receiving updates. The winner(s), who will receive up to $200,000 in cash prizes, will be announced later this spring.