The mighty RDKit

Mapping coarse-grained models back to all-atom models is a tricky problem. E.g., for polymeric systems, CG models are often used to polymerize monomers to cross-linked polymer networks. Mapping back these systems to all-atom models need to handle reactions amongst monomers. Dealing back-mapping case by case is dumb: for one specific kind of reactions, there are a family of polymers, i.e., the reaction of Polyethylene could also be used for Polystyrene.

Brief introduction to SMARTS

SMARTS is a language for describing molecular patterns, one may find details from this link. SMARTS is not only aimed to search substructures from a molecule, it could also define abstractly a reaction in the way of functional groups, for example:

'[C:1](=[O:2])O.[N:3]>>[C:1](=[O:2])[N:3]'

defines an amidation reaction, any RCOOH and RNH2 are connected as RCO-NHR. Therefore, we may use SMARTS to match the functional groups in reactants and build productions from the pre-defined rules.

RDKit: Open-Source Cheminformatics Software

Luckily, we needn’t to handle SMARTS by hand, the RDKit offers convenient tools dealing with reactions. There is an example from RDKit:

from rdkit import Chem
from rdkit.Chem import rdChemReactions
rxn = rdChemReactions.ReactionFromSmarts('[C:1](=[O:2])O.[N:3]>>[C:1](=[O:2])[N:3]')
reacts = (Chem.MolFromSmiles('C(=O)O'),Chem.MolFromSmiles('CNC'))
products = rxn.RunReactants(reacts)
len(products)
1
len(products[0])
1
Chem.MolToSmiles(products[0][0])
'CN(C)C=O'

we can see that RDKit compiles the SMARTS reaction rule and applied it on reactants by rxn.RunReactants method. Notably, the number of outcomes is not necessarily 1, for C-C + C-C -> C-C-C-C reaction, there are 4 possible outcomes, which are permutations of C11-C12-C21-C22, C11-C12-C22-C21, …, we will use this property in the following work.

Situation: Making huge molecules

For huge molecules (polymers for example), we of course may initiate reactions naturally, e.g., for chain growth, we can code as p=rxn.RunReactants([p, monomer]). However, for large molecules (with 10000+ atoms), this is very inefficient. Especially for cross-linked polymer network, since cross link may happen on any available sites, the number of products given by RDKit may be explosively large. Moreover, in case of back-mapping, the topology of CG model is fixed, i.e., reactions on CG level are already done. It is very dumb that we filtrate out from all possible products instead of directly “react” on the given sites, so we need to react “locally” according to the CG topology. Now the problem is that we cannot actually let the 2 active monomers “react”, for the polymers grow stepwisely: for instance, a polymer ABC may grow from following sequences, A+B->AB, AB+C->ABC, or B+C->BC, BC+A->ABC. The “element” reactions (or connection rules of monomers to be more precise) are A+B and B+C, if A and B are choosen first, after reaction, reactants become products, AB. In the next reaction, B and C, we’ll have to handle AB+C rather than B+C. To avoid troubles like this, we have to keep the monomer un-reacted and only trace the changes, then apply all changes at one-time after all reactions. We therefore need to make a map which maps atoms in products to reactants.

Atom and bond maps

Atom map

In any reaction, atoms are classified into 2 categories, reacting atoms and un-reacting atoms. Here I give an example to map atoms between products and reactants after rxn.RunReactants. We first need to preprocess the reactants, in RDKit, we can use atom.SetIntProp to assign an integer property to atom, in this case, if we have multiple reactants, i.e., (r1, r2, r3,…)->(p1, p2, p3,…), we will add property “reactant_idx=1,2,3,..” to atoms in (r1, r2, r3,…) accordingly.

We also need a container to record information of an atom: for an atom in one product (product_id and product_atom_id), we need to known where the atom come from (reactant_id), which atom of the corresponding reactant (reactant_atom_id), for example, for reaction ‘A-B+C-D->A-B-D+C’, atom D has product_id=0, reactant_id=1, reactant_atom_id=1, product_atom_id=2. The codes are shown blow, property “reactant_idx” set for every atom in RDKit.Chem.Mol helps to map un-reacted atoms.

Bond map

The bond map is also straightforward, we need to know a bond (i, j, bond_type) from production and where its atoms came from. In a reaction, a bond in reactant may change its bond type, e.g., double bond to single bond, or break. Also bonds in production may generate, as in former example, bond B-D.

atom_info = namedtuple('atom_info',
                       ('reactant_id', 'reactant_atom_id', 'product_id', 'product_atom_id', 'atomic_number'))
bond_info = namedtuple('bond_info',
                       ('product_id', 'product_atoms_id', 'reactants_id', 'reactant_atoms_id', 'bond_type', 'status'))
def process_reactants(reactants):
    r"""Give reactant index for reactants, please do not use SAME molecules, e.g., [m, m]
    :param reactants: list of reactants
    :return: list of reactants
    """
    for r_idx, m in enumerate(reactants):
        for atom in m.GetAtoms():
            atom.SetIntProp('reactant_idx', r_idx)
    return reactants


def map_reacting_atoms(reaction):
    r"""
    :param reaction: Reaction object
    :return: dict, reacting atoms: reactant index
    """
    reacting_map = {}
    for r_idx in range(reaction.GetNumReactantTemplates()):
        # print(r_idx, reaction.GetNumReactantTemplates())
        rt = reaction.GetReactantTemplate(r_idx)
        for atom in rt.GetAtoms():
            if atom.GetAtomMapNum():
                reacting_map[atom.GetAtomMapNum()] = r_idx
    return reacting_map


def map_atoms(products, reacting_map):
    r"""
    :param products:  list of products
    :param reacting_map: reacting atom map
    :return: atom map and reacting atoms
    """
    amap = []
    ra_dict = {}
    for ip, p in enumerate(products):
        p_idx = ip
        for a in p.GetAtoms():
            p_aidx = a.GetIdx()
            old_mapno = a.GetPropsAsDict().get('old_mapno')
            r_aidx = a.GetPropsAsDict().get("react_atom_idx")
            if old_mapno is not None:  # reacting atoms
                r_idx = reacting_map[old_mapno]
                if ra_dict.get(r_idx) is None:
                    ra_dict[r_idx] = []
                ra_dict[r_idx].append(r_aidx)
            else:
                r_idx = a.GetPropsAsDict().get('reactant_idx')
            amap.append(atom_info(r_idx, r_aidx, p_idx, p_aidx, a.GetAtomicNum()))  # all Atoms
    return amap, ra_dict


def atom_map(products, reaction):
    r"""Map atoms between reactants and products
    :param products: product list
    :param reaction: reaction object
    :return: atom map and reacting atoms
    """
    reacting_map = map_reacting_atoms(reaction)
    amap, reacting_atoms = map_atoms(products, reacting_map)
    return amap, reacting_atoms


def bond_map(reactants, products, reaction):
    r"""Map bonds between reactants and products
    :param reactants: list of reactants
    :param products:  list of products
    :param reaction:  reaction object
    :return:  list of bond map
    """
    amap, reacting_atoms = atom_map(products, reaction)
    res = []
    for ir, r in enumerate(reactants):
        for bond in r.GetBonds():  # exist in reactants, but not in production
            a, b, t = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType()
            for atom in amap:
                if atom.reactant_id == ir:
                    if atom.reactant_atom_id == a:
                        a_pid = atom.product_id
                        a_paid = atom.product_atom_id
                    if atom.reactant_atom_id == b:
                        b_pid = atom.product_id
                        b_paid = atom.product_atom_id
            if a_pid == b_pid:
                p_bond = products[a_pid].GetBondBetweenAtoms(a_paid, b_paid)
                if p_bond is None:
                    res.append(bond_info(a_pid, (a_paid, b_paid), (ir, ir), (a, b), t, 'deleted'))

    for ip, p in enumerate(products):
        for bond in p.GetBonds():  # exist in productions, compare with reactants
            a, b, t = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType()
            for atom in amap:
                if atom.product_id == ip:
                    if atom.product_atom_id == a:
                        a_rid = atom.reactant_id
                        a_raid = atom.reactant_atom_id
                    if atom.product_atom_id == b:
                        b_rid = atom.reactant_id
                        b_raid = atom.reactant_atom_id
            if a_rid != b_rid:
                res.append(bond_info(ip, (a, b), (a_rid, b_rid), (a_raid, b_raid), t, 'new'))
            if a_rid == b_rid:
                r_bond = reactants[a_rid].GetBondBetweenAtoms(a_raid, b_raid)
                if r_bond is None:
                    res.append(bond_info(ip, (a, b), (a_rid, b_rid), (a_raid, b_raid), t, 'new'))
                else:
                    if r_bond.GetBondType() != t:
                        res.append(bond_info(ip, (a, b), (a_rid, b_rid), (a_raid, b_raid), t, 'changed'))
    return res

Exmaple: Making polymers

2-monomer reactions

If the system only contains 2-monomer reactions, for example, for PS or PE, we only have A-A where the reactions always take form as P+M->PM. For some cross-linking system, a branch site B on the chain, ..A-B-A…, if B has a cross-linking functional group which reacts with C, i.e., …A-B(C)-A…, the reaction is …A-B-A… + C -> …A-B(C)-A…., the reaction are actually happened on B+C. For such cases, we only need to iterate over the edges (CG polymer is a GRAPH made from conjunctions (EDGES) of monomers (NODES)) and using RDKit to handle the reactions between the 2 active monomers.

Complex reactions involve 3 or more monomers and multi-step reactions

For example, melamine. Where 3 molecules react into an aromatic ring. And phenol (A) formaldehyde (B) resins, the formaldehyde (B) changes after first step reaction with phenol (A), the actual reaction is a 2-step reaction: A+B->AB, AB+A->ABA. For these 2 cases, we need more detailed information in CG models, the reactions and corresponding SMARTS should be defined as an overall reaction, e.g., A+B+A->ABA.

  1. Get monomers by monomer index;
  2. Apply reaction rules and get atom and bond maps;
  3. Record the changing of atoms and bonds;
  4. Record which atoms are involved in this reaction;

Step.4 is important, remember that we don’t allow actual reactions happen, for every reaction, we all take the original molecules to run the reactions (if there are 2-or-more-step reactions, simply add them up). For some symmetrical monomers, e.g., H2N-Ar-NH2, which has 2 same sites. If one site has already reacted, then this site is forbidden in the next reaction. There are at least 4 possible outcomes for 2 symmetrical monomers, we need to iterate over these 4 outcomes to pick out the reaction with un-reacted functional groups for each monomer.

Force field

Force field parameters of a given atom is defined from its chemical environment. After back-mapping of the topology, we can further use Foyer or openforcefield to generate force field parameters, both Foyer and openforcefield can easily get topology from RDKit. Or we can simply use rdkit.Chem.rdMolChemicalFeatures model to find parameters from pre-defined force field database, e.g., “PARAMETER” “CHEMICAL ENV SMARTS” or compare molecule similarities from database.

Back2Top ^