Automating Oncology Data Structuring Using LLMs on Danny Platform

Over the past seven years, Danny Platform has emerged as a leading solution for structuring and validating extensive oncology electronic health record (EHR) data. Through a combination of sophisticated algorithms and rigorous manual quality assurance, Danny Platform has compiled a high-quality, labeled dataset capturing key clinical insights across various cancer types. This dataset draws from a comprehensive range of data sources, including inpatient care records, radiology and pathology reports, clinical protocols, and ambulatory procedure documentation.

We are entering a new phase in the development of Danny Platform with the adoption of Large Language Models (LLMs) for data structuring. By the end of 2024, Danny Platform is expected to have structured over 2,000,000 clinically significant variables using LLMs from a cohort of more than 150,000 cancer patients in Romania alone.

This initiative has further increased our capacity for developing disease-specific, anonymized, research-ready databases that serve as the basis for oncological research and clinical advancements.

The step toward automation with Large Language Models (LLMs)

The next evolutionary step for Danny Platform has been automating its well-established data structuring process. This phase harnessed the power of AI/ML natural language processing (NLP) models, particularly the latest generation of LLMs, to facilitate information extraction, building upon the rich structured real-world data (RWD) that had been manually and algorithmically curated.

Danny DataStruct, as part of Danny Platform, combines the precision of NLP in extracting specific information through proprietary AI/ML algorithms for real-world data structuring with the capability of LLMs to comprehend broader contexts.

These LLMs were precisely trained to analyze and extract information from unstructured clinical documents with an accuracy level comparable to that of expert clinical researchers. The deployment of LLMs marked a transformative phase, allowing for scalable, precise data analysis across various oncology use cases.

Implementation and model architecture

Danny Platform employed a diverse selection of LLM architectures, tailored to extract clinically relevant variables across multiple data categories, including:

  • Binary variables such as identifying positive or negative biomarker test results.
  • Numerical variables, structuring quantitative data such as biomarker expression percentages.
  • Categorical unordered variables, capturing classifications like smoking status (e.g., never smoker, history of smoking, current smoker).
  • Categorical ordered variables, extracting data on cancer staging (e.g., TNM classification).
  • Date variables, extracting and validating important timelines (e.g., diagnosis dates, treatment initiation dates).

A unified LLM model was employed for these extractions, with additional specialized models developed to standardize and normalize the data post-extraction.

High-performance accuracy

The LLM models exhibited an average accuracy rate of 92%, demonstrating their robust ability to process complex clinical data accurately. This capability extended across a variety of use cases, illustrating the model’s adaptability and precision. Key examples include:

  • Cancer diagnosis and corresponding dates. LLMs were trained to identify diagnoses – both initial and advanced – alongside the relevant diagnostic dates, facilitating comprehensive patient profiling.
  • Disease stage and histology extraction. Models extracted and detailed cancer staging, covering TNM classification and overall stage (I-IV) with sub-stage specificity (A-C). In disease-specific scenarios, such as non-small cell lung cancer (NSCLC), LLMs effectively distinguished between histological subtypes like non-squamous and squamous cell carcinoma.
  • Biomarker testing and results. The LLMs detected documentation related to biomarker tests (e.g., LLMs were used to identify documents containing biomarker test results (e.g., ALK, EGFR, KRAS, PD-L1, ER, PR, HER2) and to extract results, such as positive/negative status or expression percentages. A separate model normalized these results. This approach allowed for the determination of biomarker status at multiple clinical milestones (e.g., advanced diagnosis date, treatment start date).), extracting results and expression levels. A secondary model normalized these values, allowing researchers to correlate biomarker status with specific clinical milestones, such as at the time of advanced diagnosis or treatment initiation.
  • Treatment identification and timing. The models also extracted detailed treatment information, identifying specific chemotherapy and targeted therapy drugs and their corresponding start and end dates. These treatment lists were curated by oncology professionals and customized to reflect the disease focus of each dataset.

Scalability and Multilingual Support

Sqilline is focused on enhancing the scalability of its data analytics platform by developing proprietary large language models (LLMs) optimized for clinical real-world data (RWD) in German, Italian, Spanish, Romanian, and Serbian. These models are designed to address the linguistic complexities inherent in multilingual oncology data, enabling accurate and contextually appropriate processing across diverse languages.

The internally developed translation library standardizes multilingual input documents into English. This capability ensures consistency and interoperability in downstream analyses, addressing the critical need for harmonization in cross-regional healthcare research. The integration of native language processing and translation into Danny Platform’s pipeline is designed to support broader and more rapid scalability.

Future implications and research potential

The integration of LLMs for data structuring on Danny Platform represents a significant step forward in oncology data processing. By automating extraction tasks with high precision, researchers gain access to rich, structured data faster and more efficiently. This facilitates deeper analysis, accelerated research, and, ultimately, a better understanding of cancer treatment outcomes.

Danny Platform’s approach underscores a promising future for AI in healthcare, where innovative technology can give access to comprehensive, real-world clinical insights. By 2025 and beyond, continued advancements in AI-driven data structuring will further support oncological research, allowing healthcare professionals and researchers to make informed decisions that improve patient outcomes.

Share this article:

More News & Highlights

News

Sqilline Scientific Research Project Flow

Is there a clinical problem you want to solve? Do you have a hypothesis you want to test? Do you want to clarify your research...

Read more...

News

ChatGPT for Health Data

Desislava Mihaylova founded the company for processing and analyzing Big Data, Sqilline, and a personal tragedy pushed her towards the healthcare sector. The company has...

Read more...

News

Results & Insights from the Second National Study on Pediatric Cancers Survival

Bucharest, February 15, 2024 - Sqilline Health SRL played a key role in unveiling Romania's second national study on childhood cancers survival alongside the Dăruiește...

Read more...