### A note on Cython Parallelism

Instead of using thread-local ndarrays, a very convenient method is to create an array shaped in (num_of_threads, <whatever the structure>) to make the changes thread-safe. For example:
cdef np.ndarray[np.double_t, ndim=2] ret
ret = np.zeros((num_threads, nbins), dtype=np.float)
and in the prange loop:
thread_num = openmp.omp_get_thread_num()
...
Afterwards, ret.sum(axis=0) is returned. The loop is set to use num_threads threads: with nogil, parallel(num_threads=num_threads):. By default, I set the prange with schedule='dynamic' option, on my laptop (ArchLinux), with openmp=201511 and gcc=9.2.0, I found interesting outputs, the summation of ret is exactly num_threads times the desired results, i.e. ret.sum(axis=0)[0] should be 200, but if 4 threads are used in parallel computation, then the output becomes 800. However, this does not happen on my server (CentOS 7.6) with gcc=4.2.0 and openmp=201107. All others are the same (Python, Cython, NumPy...) including .c files generated by Cython, for I use Anaconda. This wired problem is solved after setting schedule='static' on my laptop. I think this is some new feature of openmp, I shall read the docs to figure this out when I have some time.