llama: handling `max_tokens` when on a `run` op #37

iboB · 2024-07-31T06:09:38Z

iboB
Jul 31, 2024
Maintainer

Currently on a run job, if max_tokens is not provided we limit the number to 2000.

With infinite context many models can just spew endless BS forever, so we don't want to default to uint32_max as this would consume gigabytes of memory and minutes of time.

We should think of a way to handle it better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: handling `max_tokens` when on a `run` op #37

{{title}}

Replies: 0 comments

Select a reply

llama: handling max_tokens when on a run op #37

iboB Jul 31, 2024 Maintainer

Replies: 0 comments

llama: handling `max_tokens` when on a `run` op #37

iboB
Jul 31, 2024
Maintainer