Saturday, May 23, 2020

Fancy way to calculate polymer chain model characteristics via NumPy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:
  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

The ndarray is also the basic data structure of JIT tools like Numba, which boosts python dramatically. Recently, I wrote a program about calculation of the gyration tensor, center of mass and end to end vectors of polymer chain models under periodic boundary condition, and I found that using NumPy and Numba would finish the task in very few lines of codes, and also applicable to various situations. Here is the code:
def pbc(r, d):
    return r - np.round(r / d)

def rgTensor(x, mass, box):
    bond_vecs = pbc(np.diff(x, axis=-2, prepend=x[..., :1, :]),
                    box).cumsum(axis=-2)
    ree = bond_vecs[...,-1,:]
    cm = np.sum(bond_vecs * np.expand_dims(mass, axis=-1), axis=-2) / mass.sum(
        axis=-1, keepdims=True)
    bond_vecs -= np.expand_dims(cm, -2)
    cm = pbc(x[..., 0, :] + cm, box)
    return np.einsum('...pi,...pj->...ij', bond_vecs,
                     bond_vecs) / bond_vecs.shape[-2], cm, ree
This simple code could handle lots of situations, for example, suppose that we have N trajectories, each trajectory contains M time frames, and C chains in each time frame with L be the chain length, in D dimensional space, then we have an array of shape (N, M, C, L, D). Due to PBC is applied, we also need information of box, if NVT simulation is performed, then box would be, generally, a 1-D NumPy array of shape (D,). For NPT simulations, the box should be given in (N, M, D) means that for N trajectories and M time frames for each. The broadcasting property of NumPy array tells us if we want function pbc work, we need box be, for example, either (D,) or (N, M, 1, 1, D), otherwise, we cannot broadcast through the chain length and chain count dimension. Therefore, we need to expand dimensions to box if we perform NPT simulations, by calling box = np.expand_dims(box, axis=(-2, -3)), this is because that if we have different box for each frame, then except for the last axis (D) of box, the other axes must exactly match number of trajectories, number of frames..., etc. The box information is only irrelevant to chain information, which contains 2 dimensions (C, L), therefore, expand box twice from the last axis transforms the shape of box from (..., D) to (..., 1, 1, D) to align with the sample, whose shape is (..., C, L, D). The algorithm of calculation of gyration tensor is straightforward, 1. we unwrap the chain from pbc box; 2. set center of mass of the chain to 0; 3. the gyration tensor is a (D, D) matrix which is the product of Cartesian coordinate matrix, gyration tensor $T=R^TR/L$ for $R$ is $(L,D)$ the Cartesian coordinate matrix that represents $D$ dimensional space and $L$ coordinates. In the 1st step, we simply assume that no bond vector contains components larger than corresponding box components. Therefore, using function pbc(bond_vector, box) will unwrap bond vectors. After we unwrap all bond vectors, a cumulation sum is performed on bond vectors, we then have an unwrapped chain with its 1st monomer at $\mathbf{0}$. Now the center of mass of unwrapped chain is simply the mean of coordinates. After removing the center of mass, we calculate the gyration tensor by using the matrix production mentioned above. In the above code, let x be the coordinates, which has shape of (..., C, L, D), the 2nd (axis=-2) axis from last is always the chain length axis. The bond vectors are the difference of the positions of chain monomers, generally, the difference of an n-array is (n-1)-array due to the boundary condition, here we use prepend option to set the left boundary of the array with shape (..., C, 1, D) which equals to the position of 1st monomer of each chain, :1 used here to keep the dimension, if we set prepned=x[..., 0, :], an error will be raised, for x[..., 0, :] has shape of (..., C, D), which is 1-dimensional lesser than x. After the difference operation, the bond vectors are obtained, for each chain, the "1st" bond is always $\mathbf{0}$, we now could call pbc function, as mentioned above, pbc function takes bond vectors (..., C, L, D) and boxes (..., 1, 1, D) or just (D,), after pbc function being called, a cumulation sum is called along axis -2, which is the chain length axis, bond vectors on each chain were cumulatively summed, with 1st coordinate be $\mathbf{0}$ and last, i.e. bond_vecs[..., -1, :] be the end-to-end vector. The center of mass is just the weighted mean of mass of each monomer; mass, which is generally a 1-D array with shape of chain length (L,), we need to extend its shape to (L, 1) to broadcast to the -2 nd L-axis. After removing the center of mass, we calculate the matrix product of (..., D, L),(..., L, D)->(...,D, D), this part is a little bit tricky, we could use np.matmul which is essentially np.einsum: the matmul function will automatically perform on last 2 axis, for example, for an array A consists of N matrices of (a, b), and B consists of N matrices of (b, c), np.matmul(A, B) will be a new (N, a, c) array with each (a, c) matrix be the corresponding (a, b) and (b, c) matrices, i.e., $A_i\cdot B_i=C_i$ for $i=0,...,N-1$. Therefore, using np.matmul, it will be np.matmul(np.swapaxes(bond_vecs, -2, -1), bond_vecs), which is $T=R^TR$. Using np.einsum will bypass the transpose, we just summing up along the L-axis ---- in the code, it's the duplicated indices p. This program will cover almost all cases as long as the chain number, chain length and spatial dimension are the last 3 axis of data. This code could be further optimized by Numba, Numba allows one to define generalized ufuncs by @guvectorize decorator of Numba, e.g.,
from numba import float64
from numba import guvectorize
@guvectorize([(float64[:, :], float64[:, :], float64[:, :])],
             '(n,p),(p,m)->(n,m)', target='parallel')  # target='cpu','gpu'
def batch_dot(a, b, ret):
    for i in range(ret.shape[0]):
        for j in range(ret.shape[1]):
            tmp = 0.
            for k in range(a.shape[1]):
                tmp += a[i, k] * b[k, j]
            ret[i, j] = tmp
and simply replace the np.einsum... with batch_dot(np.swapaxes(bond_vecs, -2, -1), bond_vecs). There would be ~20x faster using batch_dot than np.einsum.
^ Back to Top