Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] why the implementation of f16xs8 mixed gemm is different between TRT-LLM and native cutlass mixed gemm example? #2659

Open
danielhua23 opened this issue Jan 5, 2025 · 2 comments
Assignees
Labels
Investigating Performance Issue about performance number triaged Issue has been triaged by maintainers

Comments

@danielhua23
Copy link

Dear TRT-LLM team,

lets consider sm80 and f16s8, the cutlass example of f16s8 TN mixed gemm shown here is different from TRT-LLM implementation, specifically, to my knowledge, the TRT-LLM one added the dequantization scale, but the cutlass one did not. Then my questions are:

  1. Is the performance or accuracy of TRT-LLM adding dequantization scale better than cutlass native one in LLM linear cases?
  2. from here, I see the TRT-LLM one seems load operand B(s8) using LDS not LDSM, but I can't find the f16s8 LDS specialization in MmaTensorOpMultiplicandTileIterator, only find LDS specialization for TF32, which make me confused with the “LDS". Am I missing something?

Thanks your time!

@danielhua23
Copy link
Author

@nv-guomingz nv-guomingz added the Performance Issue about performance number label Jan 6, 2025
@nv-guomingz
Copy link
Collaborator

@Barry-Delaney could u please comment this question?

@github-actions github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Investigating Performance Issue about performance number triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants