ze. Finally, the GPU approach presented here is far from being optimized. The first concern is how much memory is sent from the CPU to the GPU. GPUs have limited memories and a worker may be able to store more data in the main memory. mon approach for this scenario is to split the GPU memory in chunks. While the GPU is processing a chunk, the CUDA driver is asynchronously sending the remaining chunks. By synchronizing data copies with kernel executions, it is possible to process more data than what really fits in memory. The overhead of this process can be lower than expected, since copies can be performed in parallel putations that do not target the same data. anization of work-groups and work-items can also be improved. Our solution left many GPU cores idle while trying to maximize cache