Implement code detecting max work size and optimal vector width. Use this to patch the kernel to suit the idea values for the card. Then use these values when invoking the kernel.