You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to implement speculative decoding using TensorRT-LLM backend, specifically with Qwen2.5-72B-Instruct-AWQ as the target model and Qwen2.5-3B-Instruct-AWQ as the draft model. However, I've encountered some difficulties:
There seems to be no clear documentation or examples demonstrating how to configure tensorrtllm_backend to serve speculative decoding.
The trtllm-serve doesn't appear to support speculative decoding functionality.
Could someone provide guidance on:
Proper configuration steps for speculative decoding with tensorrtllm_backend
Whether there are any workarounds for trtllm-serve to support this feature
Any existing examples or reference implementations
Any help would be greatly appreciated. Thanks!
The text was updated successfully, but these errors were encountered:
Hi there,
I'm trying to implement speculative decoding using TensorRT-LLM backend, specifically with Qwen2.5-72B-Instruct-AWQ as the target model and Qwen2.5-3B-Instruct-AWQ as the draft model. However, I've encountered some difficulties:
Could someone provide guidance on:
Any help would be greatly appreciated. Thanks!
The text was updated successfully, but these errors were encountered: