## Saturday, May 23, 2020

### Fancy way to calculate polymer chain model characteristics via NumPy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:
• a powerful N-dimensional array object
• tools for integrating C/C++ and Fortran code
• useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

The ndarray is also the basic data structure of JIT tools like Numba, which boosts python dramatically. Recently, I wrote a program about calculation of the gyration tensor, center of mass and end to end vectors of polymer chain models under periodic boundary condition, and I found that using NumPy and Numba would finish the task in very few lines of codes, and also applicable to various situations. Here is the code:
def pbc(r, d):
return r - np.round(r / d)

def rgTensor(x, mass, box):
bond_vecs = pbc(np.diff(x, axis=-2, prepend=x[..., :1, :]),
box).cumsum(axis=-2)
ree = bond_vecs[...,-1,:]
cm = np.sum(bond_vecs * np.expand_dims(mass, axis=-1), axis=-2) / mass.sum(
axis=-1, keepdims=True)
bond_vecs -= np.expand_dims(cm, -2)
cm = pbc(x[..., 0, :] + cm, box)
return np.einsum('...pi,...pj->...ij', bond_vecs,
bond_vecs) / bond_vecs.shape[-2], cm, ree
This simple code could handle lots of situations, for example, suppose that we have N trajectories, each trajectory contains M time frames, and C chains in each time frame with L be the chain length, in D dimensional space, then we have an array of shape (N, M, C, L, D). Due to PBC is applied, we also need information of box, if NVT simulation is performed, then box would be, generally, a 1-D NumPy array of shape (D,). For NPT simulations, the box should be given in (N, M, D) means that for N trajectories and M time frames for each. The broadcasting property of NumPy array tells us if we want function pbc work, we need box be, for example, either (D,) or (N, M, 1, 1, D), otherwise, we cannot broadcast through the chain length and chain count dimension. Therefore, we need to expand dimensions to box if we perform NPT simulations, by calling box = np.expand_dims(box, axis=(-2, -3)), this is because that if we have different box for each frame, then except for the last axis (D) of box, the other axes must exactly match number of trajectories, number of frames..., etc. The box information is only irrelevant to chain information, which contains 2 dimensions (C, L), therefore, expand box twice from the last axis transforms the shape of box from (..., D) to (..., 1, 1, D) to align with the sample, whose shape is (..., C, L, D). The algorithm of calculation of gyration tensor is straightforward, 1. we unwrap the chain from pbc box; 2. set center of mass of the chain to 0; 3. the gyration tensor is a (D, D) matrix which is the product of Cartesian coordinate matrix, gyration tensor $T=R^TR/L$ for $R$ is $(L,D)$ the Cartesian coordinate matrix that represents $D$ dimensional space and $L$ coordinates. In the 1st step, we simply assume that no bond vector contains components larger than corresponding box components. Therefore, using function pbc(bond_vector, box) will unwrap bond vectors. After we unwrap all bond vectors, a cumulation sum is performed on bond vectors, we then have an unwrapped chain with its 1st monomer at $\mathbf{0}$. Now the center of mass of unwrapped chain is simply the mean of coordinates. After removing the center of mass, we calculate the gyration tensor by using the matrix production mentioned above. In the above code, let x be the coordinates, which has shape of (..., C, L, D), the 2nd (axis=-2) axis from last is always the chain length axis. The bond vectors are the difference of the positions of chain monomers, generally, the difference of an n-array is (n-1)-array due to the boundary condition, here we use prepend option to set the left boundary of the array with shape (..., C, 1, D) which equals to the position of 1st monomer of each chain, :1 used here to keep the dimension, if we set prepned=x[..., 0, :], an error will be raised, for x[..., 0, :] has shape of (..., C, D), which is 1-dimensional lesser than x. After the difference operation, the bond vectors are obtained, for each chain, the "1st" bond is always $\mathbf{0}$, we now could call pbc function, as mentioned above, pbc function takes bond vectors (..., C, L, D) and boxes (..., 1, 1, D) or just (D,), after pbc function being called, a cumulation sum is called along axis -2, which is the chain length axis, bond vectors on each chain were cumulatively summed, with 1st coordinate be $\mathbf{0}$ and last, i.e. bond_vecs[..., -1, :] be the end-to-end vector. The center of mass is just the weighted mean of mass of each monomer; mass, which is generally a 1-D array with shape of chain length (L,), we need to extend its shape to (L, 1) to broadcast to the -2 nd L-axis. After removing the center of mass, we calculate the matrix product of (..., D, L),(..., L, D)->(...,D, D), this part is a little bit tricky, we could use np.matmul which is essentially np.einsum: the matmul function will automatically perform on last 2 axis, for example, for an array A consists of N matrices of (a, b), and B consists of N matrices of (b, c), np.matmul(A, B) will be a new (N, a, c) array with each (a, c) matrix be the corresponding (a, b) and (b, c) matrices, i.e., $A_i\cdot B_i=C_i$ for $i=0,...,N-1$. Therefore, using np.matmul, it will be np.matmul(np.swapaxes(bond_vecs, -2, -1), bond_vecs), which is $T=R^TR$. Using np.einsum will bypass the transpose, we just summing up along the L-axis ---- in the code, it's the duplicated indices p. This program will cover almost all cases as long as the chain number, chain length and spatial dimension are the last 3 axis of data. This code could be further optimized by Numba, Numba allows one to define generalized ufuncs by @guvectorize decorator of Numba, e.g.,
from numba import float64
from numba import guvectorize
@guvectorize([(float64[:, :], float64[:, :], float64[:, :])],
'(n,p),(p,m)->(n,m)', target='parallel')  # target='cpu','gpu'
def batch_dot(a, b, ret):
for i in range(ret.shape[0]):
for j in range(ret.shape[1]):
tmp = 0.
for k in range(a.shape[1]):
tmp += a[i, k] * b[k, j]
ret[i, j] = tmp
and simply replace the np.einsum... with batch_dot(np.swapaxes(bond_vecs, -2, -1), bond_vecs). There would be ~20x faster using batch_dot than np.einsum.

## Saturday, January 18, 2020

### Not SQ again...

Structure factor (SQ) is a characterization quantity which is frequently calculated in molecular simulations. It is easily to code the calculation program according to the definition. In most molecular simulations, periodic boundary condition is adopted therefore
$$S(\mathbf{q}):=\mathcal{FT}\lbrace\langle\rho(\mathbf{r})\rho(\mathbf{0})\rangle\rbrace(\mathbf{q})$$
is actually a circular convolution, where FFT can give a dramatic boost in contrast to calculate directly. The steps using FFT are:
1. Calculate $\rho(\mathbf{r})$, which is a summation of Dirac-Delta function that can be estimated as a hisotgram;
2. Calculate $\hat{\rho}(\mathbf{q})$ by FFT;
3. $S(\mathbf{q})=|\hat{\rho}(\mathbf{q})|^2$;
4. Calculate mean over modulus: $S(q)=\int S(\mathbf{q})\delta(|\mathbf{q}|-q)\mathrm{d}\mathbf{q}/\int \delta(|\mathbf{q}|-q)\mathrm{d}\mathbf{q}$
Efficiency
If the simulation box is divided into $N$ (in 3D systems, $N=N_xN_yN_z$ for example) bins, the FFT gives $O(N\log(N))$ complexity and step 4 is $O(N)$. Generally, in comparison with the direct method, one needs at least loop over number of particles and $N$ bins for $\mathbf{q}$, the number of particles is obviously, way larger than $\log(N)$. For data of most simulations are real numbers, rFFT designed for real inputs could further boost the program. Here is code for a 3D example, i, j, k represent $N_x$, $N_y$, $N_z$ respectively.
fftn(a) == np.concatenate([rfftn(a), conj(rfftn(a))[-np.arange(i),-np.arange(j),np.arange(k-k//2-1,0,-1)]], axis=-1)

Accuracy
1. Binsize effect.
2. The binsize in histogram should be smaller than half of the minimun of the pair distances in the system according to Nyquist-Shannon sampling theorem. For example, in Lennard-Jones systems, $0.4\sigma$ is a good choice. However, if values of only some small $q$ are concerned, i.e., in some self-assembly systems, there is no need to calculate in the "pair accuracy", the sampling distance smaller than half of the interested domain size is fine.
3. Randomness in the bin
4. Particles in bins are randomly distributed, especially for large binsize mentioned above. The summation $\sum_i \exp(-\mathbb{I}\mathbf{q}\cdot\mathbf{r}_i)$ can be decomposed into $$\sum_\mathrm{bin} \exp(-\mathbb{I}\mathbf{q}\cdot\mathbf{r}_\mathrm{bin})\left(\sum_i \exp( -\mathbb{I}\mathbf{q}\cdot\delta\mathbf{r}_i)\right)$$ The idea is straightforward: for particles inside a bin, i.e., for particle $i$,  position $\mathbf{r}_i$ can be decomposed as $\mathbf{r}_i=\mathbf{r}_\mathrm{bin}+\delta\mathbf{r}_i$, and for particles in the bin, $\delta\mathbf{r}$ is assumed randomly distributed in the bin. The latter summation in the decomposition is thus represented as $n_\mathrm{bin}\overline{\exp(-\mathbb{I}\mathbf{q}\cdot\delta\mathbf{r})}$, the average is approximated according to the distribution of $\delta\mathbf{r}$. If we consider $\delta r$ (1D case, for example) uniformly distributed in $(-L/2, L/2)$ with $L$ be the binsize, the average is $\mathrm{sinc}(qL/2)$. Multiply the sinc function during step 4, the $S(q)$ will be corrected.

## Saturday, January 11, 2020

### Free energy calculation: umbrella integration

Formulae of umbrella sampling method can be found on wikipedia. In umbrella sampling, a reaction coordinate $\xi$ is pre-defined from the atomic coordinates, a bias potential is added to the atoms of interest to keep $\xi$ of the system at a specific window $\xi_w$. The bias form is usually a harmonic potential:
$$u^b_w(\xi)=0.5k_w(\xi-\xi_w)^2$$
Therefore, the energy of biased system $A^b_w = A^{ub}_w + u^b_w(\xi)$. The superscript $ub$ is short for "un-biased". In the simulations, we can sample the reaction coordinate in each window and evaluate their distribution $P^b(\xi)$, since the free energy $A=-k_BT\ln(P)$, we have:
$$A_w^{ub}(\xi) = -k_BT\ln(P^b_w(\xi))-u^b_w(\xi)-F_w$$
with $F_w$ is a reference free energy of each window and remains an unknown constant. One method to derive $F_w$ is WHAM, in year 2005, Kästner et al. (The Journal of Chemical Physics 123, no. 14 (October 8, 2005): 144104.) have proposed a new method whose idea is to take derivative of $A^u_w$ with respect to $\xi$ to eliminate $F_w$, In this method, $P(\xi)$ is assumed to be analytic, e.g., a Gaussian distribution. The general idea is to expand $A(\xi)$ into Taylor series (at $\langle \xi \rangle_w$):
$$A(\xi)=a_1 \xi + a_2 \xi^2 + a_3 \xi^3 +\ldots$$
if $a_i=0$ with $i\ge 3$, the Gaussian form is restored. For systems with higher order terms, the distributions, for example, generally have a non-zero skewness. The crucial step is to determine $\lbrace a_i\rbrace$. In practice, the probability with form of $\exp(-\sum_i a_i \xi^i)$ is difficult to determine, in year 2012, Kästner et al. (The Journal of Chemical Physics 136, no. 23 (June 21, 2012): 234102) studied cases with order $4$ and small $a_3, a_4$. I have tried another more "numerical" method to calculate $\lbrace a_i\rbrace$ when datasets are poorer: fit $\exp(-\sum_i a_i \xi ^i)$ with a Gaussian KDE to find $\lbrace a_i \rbrace$. For some "poor" datasets, higher-order terms of expansion of free energy are required. The normalization factor of the distribution is estimated as $n=\int_{\xi_\mathrm{min}}^{\xi_\mathrm{max}} P(\xi)\mathrm{d}\xi$, the fitting range should be carefully chosen to ensure the convergence of $n$.

## Wednesday, December 11, 2019

### Di-block copolymer analysis: GPC

Gel Permeation Chromatography is a well known method of measuring molecular weight distribution of polymers. For di-block copolymers, a proper calibration curve can be constructed by using Mark-Houwink parameters of homopolymers (see here). Usually, di-block copolymers are obtained by synthesis one block first then "grow" the other block from the end of preceding block. The molecular weight distributions of final di-block copolymer and preceding block are accessible by GPC, the molecular weight distribution of the 2nd block, however, is hard to obtain.

Here I present a simple method to calculate the molecular weight distribution of 2nd block by GPC data of preceding block and final di-block copolymer. Assuming that we already have the molecular weight distribution of the preceding block, $P_1(n)$, it is generally that the length of the 2nd block "grow" from the depends on the preceding chain length $n$, therefore, the pdf of the 2 blocks should be a joint pdf $Q(n, m)=P_1(n)P_2(m, n)$, and the pdf of the final di-block copolymer is straightforward:
$$P(x)=\int Q(n, x-n) \mathrm{d}n$$
However, it is reasonable to assume that the $P_2(m, n)$ can be expressed as some $f(m)g(n)$ form by dropping some higher order correlations between $m,n$; hence, the $g(n)$ can be considered as a "fixing parameter" on the distribution of $P_1(n)$, the calculation of $P(x)$ is therefore a convolution. The simplest case is $g(n)=1$, which means the $P_1(n)$ and $P_2(m)$ are independent, the growth of 2nd block on the end of the 1st block is not effected by the length of the preceding block; or $g(n)\sim n^{-1}$ by considering the diffusion of the preceding block: small preceding chains tend to "grow" more 2nd block.

The evaluation of pdf of the 2nd block follows 3 steps:

1. Determine the range of molecular weight: $(x_{min}-n_{max}, x_{max}-n_{min})$, where $x$ is the chain length of the di-block copolymer and $n$ is the length of the preceding block;
2. Interpolate the distributions obtained from GPC into equally spaced molecular weight, negative part is simply 0;
3. Deconvolute $P(x)$ with $P_1(n)$.

## Monday, October 28, 2019

### A note on Cython Parallelism

Instead of using thread-local ndarrays, a very convenient method is to create an array shaped in (num_of_threads, <whatever the structure>) to make the changes thread-safe. For example:
cdef np.ndarray[np.double_t, ndim=2] ret
ret = np.zeros((num_threads, nbins), dtype=np.float)
and in the prange loop:
thread_num = openmp.omp_get_thread_num()
...
Afterwards, ret.sum(axis=0) is returned. The loop is set to use num_threads threads: with nogil, parallel(num_threads=num_threads):. By default, I set the prange with schedule='dynamic' option, on my laptop (ArchLinux), with openmp=201511 and gcc=9.2.0, I found interesting outputs, the summation of ret is exactly num_threads times the desired results, i.e. ret.sum(axis=0)[0] should be 200, but if 4 threads are used in parallel computation, then the output becomes 800. However, this does not happen on my server (CentOS 7.6) with gcc=4.2.0 and openmp=201107. All others are the same (Python, Cython, NumPy...) including .c files generated by Cython, for I use Anaconda. This wired problem is solved after setting schedule='static' on my laptop. I think this is some new feature of openmp, I shall read the docs to figure this out when I have some time.