Natural Language Processing in Healthcare: The Pros, Cons, and Potential Impact

Natural language processing (NLP) is a powerful tool revolutionizing how information is collected, stored and used in healthcare and health science. Due to the complexity of data in a healthcare setting, natural language processing is a particularly useful data management tool. In this blog, I will go over the advantages of natural language processing, its limitations, and its potential impact on the healthcare industry.

The current state of data

The modern world is becoming increasingly focused on how data guides decisions. Even the most mundane internet search for the day’s weather or viewing the weekly specials at a local grocery store generates data bought and sold by companies. Those and other data are used to specifically market products and services to potential customers by using targeted ads.

Similarly, data are being used more and more across industries, institutions and settings. Applications can range from gaining insights on how governments can best deliver services to delivering the safest and most effective treatments in healthcare. 

While many industries directly interacting with people have extremely complex relationships between data and its successful application, few meet the complexity seen in healthcare.  

Healthcare data is mostly unstructured

Healthcare data is unique because of the variation in the patients and clients, the practices of medical professionals, policies, procedures, processes in service delivery, and more. 

As a function of this complexity, documentation by medical professionals is virtually impossible to completely standardize into a series of boxes to tick and choices from pick lists. Instead, healthcare-related documentation relies heavily on spoken dictations transcribed into records or written case notes in sentence and paragraph form. These dictations often contain abbreviations or jargon.

Usable data for both descriptive and analytical reporting, analysis and visualization require structure amongst data. Often this involves data that can be specifically categorized to the trait represented and fit in a single cell amongst a row and column of a dataset. 

The more specific the criterion for categorizing data is, the more reliable and valid the insights derived from those data will be. 

Further, without specific categorization of traits and variables in a dataset, they cannot be applied with any validity against other data in relational data systems.  

The traits required for usable data are a very well-known limitation within healthcare. Considering the spoken, sentence/paragraph structure, abbreviations and jargon of clinician documentation, a large volume of healthcare information does not fit in a structured dataset.  

Using Electronic Medical Records and Health Information Management

There have been efforts to categorize healthcare data into a structure compatible with datasets. This has included creating structured diagnostic, intervention, event, billing, patient, provider, institution, location, and many other codes.  

The conversion of healthcare to these codes is a hybrid of Electronic Medical Records (EMRs), where healthcare providers select the most applicable option for the scenario and Health Information Management (HIM) staff review case notes and dictations and place that information into data categories.  

Both EMR facilitated data entry and Health Information Management (HIM) staff facilitated coding led to tremendous gains in reliable and available data. These coded data led to improvements across efficacy and safety in practice, interventions, administration, technologies and many other domains of healthcare.  

However, there are still limitations with EMR-facilitated data

  • It requires distinct categories and available options
  • If phenomena observed are not an option in the EMR, those data are lost or held in notes that are not easily available options to populate the dataset
  • HIM staff is costly
  • Often there is an overwhelming amount of data to code, leading to difficulties in adding, expanding or amending content

Natural Language Processing (NLP) as a solution

Realizing the limitations of EMR and HIM facilitated coding, healthcare and health sciences have begun to apply text and speech recognition tools that feature Artificial Intelligence (AI). These tools are specifically defined as Natural Language Processing (NLP).  

NLP examines written and spoken information for themes and categories by applying the integrated knowledge of linguistics, psychology, cognitive science, neuroscience and computer science.   

Through countless applications of NLP on uncategorized and generally unstructured language-based communications, NLP systems/software has become generally reliable at identifying, categorizing and storing data in a way that is structured.  

These structured data can then be applied to datasets that can be aggregated, combined with other data, described, analyzed and visualized with high reliability and validity.

How does NLP work in health care?

Natural Language Processing (NLP) is applied in a healthcare and health sciences context by feeding the NLP program/application lists and or groups of terms applicable to the question(s) at hand.  

Some NLP programs and applications may have applicable predefined groups of terms that fit the usage built in. NLP then examines unstructured records for the presence and or absence of terms or proxies for terms or themes. The result of the record analysis then applies the finding to a dataset.  

If programmed to do so, parallel or consecutive categorizations on the same record can occur. For example, if drug X is detected in the record, it could be placed in an observed list of drugs. Further, if programmed to do so, if a drug is detected, the dosage interval and volume could be recorded in respective categories as well.  

This makes NLP application for unstructured data very attractive in healthcare and health sciences environments as NLP tools (in theory) can be deployed with little human training or human resourcing, and results are generally made available quickly. 

NLP can also be used as a complementary data source for validating coded data. For example, the weights of newborns are identified as outliers, but analysis of the case notes indicates the person documenting is using grams, not pounds. 

Can NLP keep up with the complexity of human language? 

Despite the power of Natural Language Processing (NLP) to detect and categorize the information from unstructured data, certain scenarios compromise the validity of such categorizations.  

Language, as used by humans, has evolved to be incredibly complex, diverse and nuanced. Language is further complicated by spoken language(s), dialect, reference points (like sarcasm) and many others.  

Computer science practitioners have made tremendous progress in training NLP to detect negation (where words such as not, without, unseen, ruled out and others) are used to deter inappropriate categorization of traits in a record. For example, a record with “The radiologist report confirms the presence of a glioblastoma” would be an appropriate record to apply glioblastoma into a patient diagnosis field. On the other hand, a record with “The radiologist report shows the patient free of a glioblastoma” would not be an appropriate place to apply a diagnosis. 

Computer scientists have managed to control for many negations such as this through feeding NLP tools countless records to be analyzed and appraised for validity.  

That said, due to the complexity and diversity of language, false positives and negatives can occur. Some NLP packages provide an estimated level of probability of a valid categorization.  

It is critical that those considering the application of NLP on data both acknowledge and plan for the possibility of speech nuances creating false positives and negatives.

NLP offers powerful insights into data

While the limitations of Natural Language Processing (NLP) discussed above must always be considered when applying NLP to unstructured data, NLP offers powerful insight into data that otherwise cannot be combined with other data, aggregation, description, analysis and visualization without tremendous effort, human resources and costs. 

The application of NLP in a healthcare and health sciences context for the developing world is a fantastic example of utility.  

NLP is an option in locations where complex Electronic Medical records (EMR) systems with the associated deployment, IT staff, coding staff and others to operate and maintain such systems are too costly or unavailable. In this case, written clinician notes are analyzed by NLP, categorizing and storing data for knowledge work at a much-reduced cost.  

Similarly, even where advanced EMR systems are in place, NLP allows those looking at data to quickly apply new views of existing data and reasonably simple application of new tracked data without a great deal of staff training on classification, coding and input or IT reconfiguration of the EMR. 

By: Martin Bauwens MSc, Senior Data Scientist

MMS developed a technology platform, Datacise, that assists in performing data curation on NLP data. To learn more about Datacise or Natural Language Processing, email us at to be connected to an expert.