# Pytorch Custom CUDA kernel for searchsorted This repository is an implementation of the searchsorted function to work for pytorch CUDA Tensors. Initially derived from the great [C extension tutorial](https://github.com/chrischoy/pytorch-custom-cuda-tutorial), but totally changed since then because building C extensions is not available anymore on pytorch 1.0. > Warnings: > * only works with pytorch > v1.3 and CUDA >= v10.1 > * **NOTE** When using `searchsorted()` for practical applications, tensors need to be contiguous in memory. This can be easily achieved by calling `tensor.contiguous()` on the input tensors. Failing to do so _will_ lead to inconsistent results across applications. ## Description Implements a function `searchsorted(a, v, out, side)` that works just like the [numpy version](https://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html#numpy.searchsorted) except that `a` and `v` are matrices. * `a` is of shape either `(1, ncols_a)` or `(nrows, ncols_a)`, and is contiguous in memory (do `a.contiguous()` to ensure this). * `v` is of shape either `(1, ncols_v)` or `(nrows, ncols_v)`, and is contiguous in memory (do `v.contiguous()` to ensure this). * `out` is either `None` or of shape `(nrows, ncols_v)`. If provided and of the right shape, the result is put there. This is to avoid costly memory allocations if the user already did it. If provided, `out` should be contiguous in memory too (do `out.contiguous()` to ensure this). * `side` is either "left" or "right". See the [numpy doc](https://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html#numpy.searchsorted). Please not that the current implementation *does not correctly handle this parameter*. Help welcome to improve the speed of [this PR](https://github.com/aliutkus/torchsearchsorted/pull/7) the output is of size as `(nrows, ncols_v)`. If all input tensors are on GPU, a cuda version will be called. Otherwise, it will be on CPU. **Disclaimers** * This function has not been heavily tested. Use at your own risks * When `a` is not sorted, the results vary from numpy's version. But I decided not to care about this because the function should not be called in this case. * In some cases, the results vary from numpy's version. However, as far as I could see, this only happens when values are equal, which means we actually don't care about the order in which this value is added. I decided not to care about this also. * vectors have to be contiguous for torchsearchsorted to give consistant results. use `.contiguous()` on all tensor arguments before calling ## Installation Just `pip install .`, in the root folder of this repo. This will compile and install the torchsearchsorted module. be careful that sometimes, `nvcc` needs versions of `gcc` and `g++` that are older than those found by default on the system. If so, just create symbolic links to the right versions in your cuda/bin folder (where `nvcc` is) For instance, on my machine, I had `gcc` and `g++` v9 installed, but `nvcc` required v8. So I had to do: > sudo apt-get install g++-8 gcc-8 > sudo ln -s /usr/bin/gcc-8 /usr/local/cuda-10.1/bin/gcc > sudo ln -s /usr/bin/g++-8 /usr/local/cuda-10.1/bin/g++ be careful that you need pytorch to be installed on your system. The code was tested on pytorch v1.3 ## Usage Just import the torchsearchsorted package after installation. I typically do: ``` from torchsearchsorted import searchsorted ``` ## Testing Under the `examples` subfolder, you may: 1. try `python test.py` with `torch` available. ``` Looking for 50000x1000 values in 50000x300 entries NUMPY: searchsorted in 4851.592ms CPU: searchsorted in 4805.432ms difference between CPU and NUMPY: 0.000 GPU: searchsorted in 1.055ms difference between GPU and NUMPY: 0.000 Looking for 50000x1000 values in 50000x300 entries NUMPY: searchsorted in 4333.964ms CPU: searchsorted in 4753.958ms difference between CPU and NUMPY: 0.000 GPU: searchsorted in 0.391ms difference between GPU and NUMPY: 0.000 ``` The first run comprises the time of allocation, while the second one does not. 2. You may also use the nice `benchmark.py` code written by [@baldassarreFe](https://github.com/baldassarreFe), that tests `searchsorted` on many runs: ``` Benchmark searchsorted: - a [5000 x 300] - v [5000 x 100] - reporting fastest time of 20 runs - each run executes searchsorted 100 times Numpy: 4.6302046799100935 CPU: 5.041533078998327 CUDA: 0.0007955809123814106 ```