Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add eager flag and set compile to be always true #734

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions launcher/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -342,9 +342,18 @@ struct Args {

/// Whether you want to compile the model into a CUDA graph.
/// This will speed up decoding but increase GPU memory usage.
#[clap(long, env, value_enum)]
/// Only use either `--compile` or `--eager`. Using both at the same time will
/// result in an error.
#[clap(default_value = "true", long, env, value_enum)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this defaults to true here, will this raise an error if the user specifies --eager without setting anything for compile?

Typically we would change the type of compile here to Option[bool] so we can distinguish between explicitly specified args vs unspecified.

compile: bool,

/// Whether you want to run the model in eager mode, without
/// CUDA mode compilation, or run it with compilation.
/// Only use either `--compile` or `--eager`. Using both at the same time will
/// result in an error.
#[clap(default_value = "false", long, env, value_enum)]
eager: bool,

// The maximum batch size past which CUDA graphs are disabled.
#[clap(default_value = "128", long, env)]
compile_max_batch_size: usize,
Expand Down Expand Up @@ -656,6 +665,7 @@ fn shard_manager(
adapter_source: String,
quantize: Option<Quantization>,
compile: bool,
eager: bool,
compile_max_batch_size: usize,
compile_max_rank: usize,
compile_batch_size: usize,
Expand Down Expand Up @@ -738,10 +748,14 @@ fn shard_manager(
}

// CUDA graph compilation
if compile {
if compile && !eager {
shard_args.push("--compile".to_string());
}

if (compile && eager) || (!compile && !eager) {
panic!("Cannot use both --compile and --eager at the same time.");
}

// Speculative decoding
if let Some(speculative_tokens) = speculative_tokens {
shard_args.push("--speculative-tokens".to_string());
Expand Down Expand Up @@ -1303,6 +1317,7 @@ fn spawn_shards(
let otlp_endpoint = args.otlp_endpoint.clone();
let quantize = args.quantize;
let compile = args.compile;
let eager = args.eager;
let compile_max_batch_size = args.compile_max_batch_size;
let compile_max_rank = args.compile_max_rank;
let compile_batch_size = args.compile_batch_size;
Expand Down Expand Up @@ -1335,6 +1350,7 @@ fn spawn_shards(
adapter_source,
quantize,
compile,
eager,
compile_max_batch_size,
compile_max_rank,
compile_batch_size,
Expand Down
5 changes: 4 additions & 1 deletion server/lorax_server/models/flash_causal_lm.py
Original file line number Diff line number Diff line change
Expand Up @@ -1181,7 +1181,10 @@ def __init__(
SLIDING_WINDOW = sliding_window
SLIDING_WINDOW_BLOCKS = math.ceil(sliding_window / BLOCK_SIZE)

self.compile = compile
self.compile = compile and self.supports_cuda_graph_compilation
if compile and not self.supports_cuda_graph_compilation:
logger.info("Model does not support CUDA graph compilation. Skipping compilation.")

self.model_graph_wrapper: GraphCache = None
self.kv_cache = []

Expand Down
4 changes: 4 additions & 0 deletions server/lorax_server/models/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,10 @@ def check_initialized(self):
f"found uninitialized parameters in model {self.__class__.__name__}: {uninitialized_parameters}"
)

@property
def supports_cuda_graph_compilation(self) -> bool:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set this to False for some models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least with the limited models I tested, compile worked so we don't set this property to False for any model, but we may want to in the future. Not needed at the moment so happy to get rid of it.


@property
def supports_adapter_loading(self) -> bool:
return False
Expand Down
Loading