Fascinating work!

#1
by tomaarsen - opened

Hello!

Sentence Transformers maintainer here - this is fascinating work! The chemical natural compounds and their notations go way beyond what I'm familiar with, but it looks like the Spearman Cosine similarity is very high, and the t-SNE embeddings look quite nice!

I see that you have some plans to extend this further in the future. I wanted to point you to a potential direction of advancements: the tokenizer.
Each tokenizer tokenises text differently, and the one that you're using (from MiniLM-L6-H384-uncased) is not aware of the natural compound notations. As a result, it uses multiple tokens to denotate something that maybe can be best denoted with just one token, e.g. [C]. See an example here:

image.png

From https://huggingface.co/spaces/Xenova/the-tokenizer-playground

In short: it might make sense to 1) take an existing tokenizer trained on the chemical compounds or 2) train one yourself.
Do note that you'd likely not be able to use a pretrained model with your custom tokenizer, so you would have to perform the training from random weights. With a much smaller tokenizer, you'll also get higher throughput/faster training I suspect.

Anyways, you're free to go this route or continue finetuning "ready to go" embedding models like MiniLM-L6-H384-uncased: clearly it's also working well.

  • Tom Aarsen

Hello!

Thank you so much for your feedback, I appreciate your recommendations a lot. Currently, I am trying to either adapt zpn's SELFIES tokenizer or train a custom tokenizer for this, since in chemistry usually molecules are represented with SMILES and it is known a bit messy to train a model using it - and SELFIES seems better due to its consistency. I plan to start testing them shortly, and will proceed with training a base model with randomized weights along with reduced vocabulary size.

Thanks again for taking the time to engage with my work and for pointing me in this direction. I am relatively new in ML/AI, so I am excited to see the results!

  • G Bayu

Excellent! I think you're well in the right direction then!
Your work reminds me somewhat of the Protein Similarity and Matryoshka Embeddings blogpost by @monsoon-nlp from a few months ago, except with proteins instead. He also used Matryoshka Embeddings (blogpost, documentation) in case that strikes your fancy. In short: Matryoshka Embeddings can be truncated on the fly with minor loss in performance, allowing for faster retrieval/clustering. This can be quite nice when your use case deals with a lot of data.

  • Tom Aarsen

I didn't know about Matryoshka, but after reading both blogs a bit, I agree it would be nice for dealing with large chemical databases. I will read those blogs again and try experimenting with them after training with the base model and custom tokenizer seems good enough. Again, thank you!

  • G Bayu
gbyuvd changed discussion status to closed

Hi, how do you like the idea of ​​becoming a co-author of an article?

Owner

@Grigory1968 Sorry for the late reply, feel free to DM/Email me for further discussion

@Grigory1968 Sorry for the late reply, feel free to DM/Email me for further discussion

could you tell us more about how you did the labeling of the models?

@Grigory1968 Sorry for the late reply, feel free to DM/Email me for further discussion

could you tell us more about how you did the labeling of the models?

Please correct me if I am wrong, you are asking about how I prepared the training data and similarity labels for the chemical compounds, right?
The core logic of pairing and labelling is pretty much like described here. In short: after making the chemical pairs, I used RDKit to generate MACCS fingerprints using their SMILES for each, and then calculate the cosine similarity between those fingerprints. That metric is used as the ground truth/label to guide the training and evaluation. As for why using MACCS, it is simpler and captures general substructure motifs as compared to more complex fingerprints like ECFP4/6; besides, since this model is a prototype, MACCS fingerprints are well-suited due to their faster generation time in data preparation.

Digging into the old code I was using:

# Function to generate MACCS fingerprint
def generate_maccs_fingerprint(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    maccs = MACCSkeys.GenMACCSKeys(mol)
    return cp.asarray(maccs, dtype=cp.uint8)  # Convert to CuPy array directly

# Function to calculate cosine similarity using CuPy
def cosine_similarity_cupy(fp1, fp2):
    if fp1 is None or fp2 is None:
        return 0
    dot_product = cp.dot(fp1, fp2)
    norm_fp1 = cp.linalg.norm(fp1)
    norm_fp2 = cp.linalg.norm(fp2)
    if norm_fp1 == 0 or norm_fp2 == 0:
        return 0
    return dot_product / (norm_fp1 * norm_fp2)

@Grigory1968 Sorry for the late reply, feel free to DM/Email me for further discussion

could you tell us more about how you did the labeling of the models?

Please correct me if I am wrong, you are asking about how I prepared the training data and similarity labels for the chemical compounds, right?
The core logic of pairing and labelling is pretty much like described here. In short: after making the chemical pairs, I used RDKit to generate MACCS fingerprints using their SMILES for each, and then calculate the cosine similarity between those fingerprints. That metric is used as the ground truth/label to guide the training and evaluation. As for why using MACCS, it is simpler and captures general substructure motifs as compared to more complex fingerprints like ECFP4/6; besides, since this model is a prototype, MACCS fingerprints are well-suited due to their faster generation time in data preparation.

Digging into the old code I was using:

# Function to generate MACCS fingerprint
def generate_maccs_fingerprint(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    maccs = MACCSkeys.GenMACCSKeys(mol)
    return cp.asarray(maccs, dtype=cp.uint8)  # Convert to CuPy array directly

# Function to calculate cosine similarity using CuPy
def cosine_similarity_cupy(fp1, fp2):
    if fp1 is None or fp2 is None:
        return 0
    dot_product = cp.dot(fp1, fp2)
    norm_fp1 = cp.linalg.norm(fp1)
    norm_fp2 = cp.linalg.norm(fp2)
    if norm_fp1 == 0 or norm_fp2 == 0:
        return 0
    return dot_product / (norm_fp1 * norm_fp2)

Got it, tell me please, what is the point of modeling then, if you use prints from rdkit as the true value

Got it, tell me please, what is the point of modeling then, if you use prints from rdkit as the true value

Pretrained/fine-tuned embedding models could capture deeper "semantic meaning" from the 1D chemical structures (relation between atoms, their sequence, positions, etc.) compared to pre-programmed substructure encoding like MACCS fingerprints. The model maps these into a vector space that potentially capture more nuanced chemical relationships, as you can see from the clustering differences between using MACCS fingerprints vs. the model's embeddings.

The MACCS fingerprints there I suppose is just to guide the model in learning these semantic relations, not to replicate them 1-to-1, that would be indeed, redundant. The goal is for the model to represent chemical similarity in a richer way that might capture relationships MACCS fingerprints miss, and being fast at that.

Also, the base model was pretrained on general texts, and I only fine-tuned it here as a proof of concept. Whether this actually improves performance for finding compounds similar to existing bioactives still needs more data and evaluations. So far I only managed to build the pipeline for testing it (the FAISS demo in GitHub), so I can't say much about effectiveness yet. It's also been my hands-on learning to ML/AI.

Oh and, SELFIES is relatively new representation, compared to SMILES.

Sign up or log in to comment