A note on Cython Parallelism
Instead of using thread-local ndarrays, a very convenient method is to create an array shaped in (num_of_threads, <whatever the structure>) to make the changes thread-safe. For example: cdef np.ndarray[np.double_t, ndim=2] ret num_threads = multiprocessing.cpu_count() ret = np.zeros((num_threads, nbins), dtype=np.float) and in the prange loop: thread_num = openmp.omp_get_thread_num() ret[thread_num, l]+=1 ... Afterwards, ret.sum(axis=0) is returned. The loop is set to use num_threads threads: with nogil, parallel(num_threads=num_threads): . By default, I set the prange with schedule='dynamic' option, on my laptop (ArchLinux), with openmp=201511 and gcc=9.2.0 , I found interesting outputs, the summation of ret is exactly num_threads times the desired results, i.e. ret.sum(axis=0)[0] should be 200, but if 4 threads are used in parallel computation, then the output becomes 800. However, this does not happen on my server (CentOS 7.6) with gcc=4.2.0 and openmp=201