Thinking like a CHEMIST: Combined Heterogeneous Embedding Model Integrating Structure and Tokens

Representing molecular structures effectively in chemistry remains a challenging task. Language models and graph-based models are extensively utilized within this domain, consistently achieving state-of-the- art results across an array of tasks. However, the prevailing practice of representing chemical compounds in the SMILES format – used by most data sets and many language models – presents notable limitations as a training data format. In this study, we present a novel approach that decompose