Principal Investigators
Inclusive Dates 
07/07/2025 to 09/23/2025

Background

The original intent of the project is to create a fine-tuned functional LLM that can correctly interpret and compare SMILES strings of two molecules that may be matched pairs, differing by a change in functional group or small count of atom substitutions. The original objective of the project was to compare fine-tuning open-source LLMs and cloud LLMs with a validation framework for producing molecular encoding SMILES strings for improved lead optimization. SMILES strings are the current standard for language-based encoding and generation of molecular structures. At the Phase I project start date (12/08/2024), the validity of SMILES strings produced by the LLM ChatGPT-4o was no greater than 50%. During Phase I of the project, Llama LLM was finetuned with a matched-pair dataset and the LoRA training method, and the original results showed 89% correct choices when the LLM was presented with the task of identifying a drug-like compound vs a similar, non-approved decoy structure. During the first week of Phase B, we found that the Llama-LoRA model preference for picking the preferred SMILES string was heavily dependent on which prompt phrasing was used (whether the correct drug was presented first, or the decoy was presented first). When removing the unintended bias in training, the error rate of the LL-M was 33% invalid SMILES output and 56% correct choices of remaining valid SMILES strings.

Approach

For Phase 2, the technical plan has two parts: Tasks: 1) Refine a a non-LLM baseline method as a comparator, specifically a graph neural network (GCN) as a testbed, which we started working on in Phase I, and 2) solve the tokenization problem for the LLM with a custom vocabulary, with a task-specific focus on processing SMILES and excluding ordinary natural language to suppress interference. Specifically, we will drop the general Llama-LoRA LLM fine-tuning approach and focus on a less general but more specific decoder-only language model improve accuracy. The intent is to solve the SMILES string recognition issues arising from inappropriate tokenization. We do not drop the language approach altogether because we want to maintain the future the generative benefits of this architecture. This scaled-down language model will comprise a custom decoder-only transformer architecture which is relatively easy to implement and gives us full control of architecture and training decisions. The final task will be further training the transformer model by the method of curriculum learning on basic organic chemistry tasks and drug-like property comparisons.

Accomplishments

We determined that a central issue with our Phase I approach was inappropriate tokenization of SMLES strings with the initial fine-tuned Llama-LoRA language model. We completed a prototype custom vocabulary for SMILES interpretation for use in a smaller and more focused language model. For the non-language comparator model, we continued work on our GCN. Completed tasks for the GCN comparator model are:

  • GCN edge architecture completed with weights based on bond order.
  • GCN was pretrained using contrastive loss on matched pairs dataset.
  • Designed the GCN to process each compound independently. It outputs a “drug-like” preference score.
  • GCN preliminary result: 90.5% correct in identifying authentic drugs on Phase A matched pairs dataset (drugs vs decoys).