CUDA graph phase N - Device-side graph launch

This task should be simple: 
- Allow users to opt-in and internally pass `cudaGraphInstantiateFlagDeviceLaunch` to `cudaGraphInstantiate()`
- Add code samples to showcase how this can done

The majority of work would happen on the JIT compiler side, aka numba-cuda; cuda-core will just set up the infrastructure.