This hybrid NLP classifier was presented at 2021’s DC R Conference (hosted by Lander Analytics) and at Georgetown University’s Interdisciplinary Text Analysis Research group (hosted by the Massive Data Institute). Medical terms are linguistically very specific: a letter or two can completely change the word, and prefixes and suffixes can link two words that otherwise look wildly different. As such, many typical methods of natural language processing (NLP) are ill-adapted to work with medical records and their specific vocabulary and syntax. When a government client needed to classify medical conditions for record processing, IBM built a hybrid ensemble model that incorporates both rules-based and machine learning classification, to accommodate the client’s system structure while flexibly handling the nuances of medical terminology.
Full presentation available here.