Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

e1ijah1 · 2025-01-08T13:16:53Z

Hi there,

I'm trying to implement speculative decoding using TensorRT-LLM backend, specifically with Qwen2.5-72B-Instruct-AWQ as the target model and Qwen2.5-3B-Instruct-AWQ as the draft model. However, I've encountered some difficulties:

There seems to be no clear documentation or examples demonstrating how to configure tensorrtllm_backend to serve speculative decoding.
The trtllm-serve doesn't appear to support speculative decoding functionality.

Could someone provide guidance on:

Proper configuration steps for speculative decoding with tensorrtllm_backend
Whether there are any workarounds for trtllm-serve to support this feature
Any existing examples or reference implementations

Any help would be greatly appreciated. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

e1ijah1 commented Jan 8, 2025

Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

Comments

e1ijah1 commented Jan 8, 2025