In a recent study published in the NEJM AI, scientists from the Icahn School of Medicine at Mount Sinai shed light on the limitations of cutting-edge artificial intelligence (AI) systems, particularly in the realm of medical coding.
The findings emphasise the importance of refining and validating these technologies before their potential integration into clinical practice.
The research, conducted over a period of 12 months within the Mount Sinai Health System, involved the analysis of more than 27,000 distinct diagnosis and procedure codes, carefully excluding any identifiable patient information.
By providing descriptions for each code, the team tested various AI models from OpenAI, Google, and Meta to determine their accuracy in generating corresponding medical codes. Discrepancies between the AI-generated codes and the originals were meticulously examined for patterns.
According to the study, all examined large language models, including GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b, demonstrated subpar accuracy, with success rates falling below 50 percent.
Notably, GPT-4 exhibited the highest performance among the models, achieving the top exact match rates for ICD-9-CM (45.9 percent), ICD-10-CM (33.9 percent), and CPT codes (49.8 percent).
Despite GPT-4’s ability to produce codes that technically conveyed the correct meaning, a significant number of errors persisted.
When presented with the description “nodular prostate without urinary obstruction,” GPT-4 generated a code for “nodular prostate,” showcasing its grasp of medical terminology nuances. However, this precision did not eliminate the occurrence of errors.
The study also highlighted the importance of rigorous evaluation and ongoing development of AI technologies in sensitive areas like medical coding.
Dr. Ali Soroush, the corresponding author of the study and Assistant Professor at Icahn Mount Sinai, stressed the need for caution in approaching AI’s potential in healthcare.
While acknowledging its promise, Dr. Soroush emphasised the necessity of ensuring AI’s reliability and efficacy through continued refinement.
“Our findings underscore the critical need for rigorous evaluation and refinement before deploying AI technologies in sensitive operational areas like medical coding,” Dr. Soroush said
One potential application identified by the researchers is the automation of medical code assignment for reimbursement and research purposes based on clinical text.
Despite the challenges uncovered, the study provides valuable insights into the complexities of integrating AI into healthcare operations.
Meanwhile, in Australia the AI Trends for Healthcare report identifies the opportunities and challenges facing the continued and inevitable integration of AI in Australia’s healthcare sector; from clinical decision support to administrative tasks.
Research Director of CSIRO’s Australian e-Health Research Centre (AEHRC), Dr David Hansen said that the use of AI in healthcare is unique because the accuracy of models could mean the difference between life or death, or ongoing health or illness.
“A key difference between the use of AI in healthcare compared to other industries is the use of AI in decision making for prevention, diagnosis, monitoring and treatment,” Dr Hansen said.
“As we strive to create newer and better digital tools to harness the benefits of AI in healthcare, frameworks and ethical implementation along with established safety, quality and monitoring guidelines continue to be imperative.” he said.
While the promise of artificial intelligence in revolutionising medical coding appeared vast, recent research underscores the complexity of the task. Despite advancements, large language models face significant hurdles in accurately coding medical information
The latest study at Mount Sinai serves as a crucial reminder that while AI holds immense potential, its integration into highly specialised sectors like healthcare requires careful consideration, ongoing refinement, and collaboration