Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't utilize my GPU #444

Open
ghost opened this issue Jan 5, 2025 · 5 comments
Open

Can't utilize my GPU #444

ghost opened this issue Jan 5, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@ghost
Copy link

ghost commented Jan 5, 2025

Hi
I have 4090, now matter how hard I try, it doesn't seem to run with gpu
I set UUID in cuda_visible_devices
and still, it runs thru my cpu
I have no problem with another apps

@ghost ghost added the bug Something isn't working label Jan 5, 2025
@Jeffser
Copy link
Owner

Jeffser commented Jan 5, 2025

Hi can I get Alpaca's logs? you can find them in About Alpaca > Troubleshooting > Debugging Information

@Jeffser
Copy link
Owner

Jeffser commented Jan 5, 2025

Just as a warning, the logs might include your username, please change it before sending me the logs if you want to

@ghost
Copy link
Author

ghost commented Jan 5, 2025

Hi can I get Alpaca's logs? you can find them in About Alpaca > Troubleshooting > Debugging Information

logs.txt

@sequencerr
Copy link

https://github.com/ollama/ollama/blob/main/docs/gpu.md#laptop-suspend-resume I admit using laptop, didn't help tho.

logs.txt
Couldn't find '/home/yrch/.ollama/id_ed25519'. Generating new private key.
Your new public key is: 

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINce7hTiqcOrqSWyie0nWovja7ZtIBJB346QNHieW6EQ

2025/01/07 23:50:45 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES:GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-01-07T23:50:45.581+01:00 level=INFO source=images.go:757 msg="total blobs: 5"
time=2025-01-07T23:50:45.581+01:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-01-07T23:50:45.581+01:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11435 (version 0.5.4)"
time=2025-01-07T23:50:45.582+01:00 level=DEBUG source=common.go:80 msg="runners located" dir=/app/lib/ollama/runners
time=2025-01-07T23:50:45.582+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cpu_avx/ollama_llama_server
time=2025-01-07T23:50:45.582+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cpu_avx2/ollama_llama_server
time=2025-01-07T23:50:45.582+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cuda_v11_avx/ollama_llama_server
time=2025-01-07T23:50:45.582+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cuda_v12_avx/ollama_llama_server
time=2025-01-07T23:50:45.582+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/rocm_avx/ollama_llama_server
time=2025-01-07T23:50:45.582+01:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[rocm_avx cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx]"
time=2025-01-07T23:50:45.582+01:00 level=DEBUG source=routes.go:1340 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2025-01-07T23:50:45.582+01:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2025-01-07T23:50:45.582+01:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-07T23:50:45.583+01:00 level=DEBUG source=gpu.go:99 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-01-07T23:50:45.583+01:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=libcuda.so*
time=2025-01-07T23:50:45.583+01:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[/app/lib/ollama/libcuda.so* /app/lib/ollama/libcuda.so* /app/lib/libcuda.so* /usr/lib/x86_64-linux-gnu/GL/default/lib/libcuda.so* /usr/lib/x86_64-linux-gnu/openh264/extra/libcuda.so* /usr/lib/x86_64-linux-gnu/openh264/extra/libcuda.so* /usr/lib/sdk/llvm15/lib/libcuda.so* /usr/lib/x86_64-linux-gnu/GL/default/lib/libcuda.so* /usr/lib/ollama/libcuda.so* /app/plugins/AMD/lib/ollama/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-01-07T23:50:45.584+01:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths=[]
time=2025-01-07T23:50:45.584+01:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=libcudart.so*
time=2025-01-07T23:50:45.584+01:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[/app/lib/ollama/libcudart.so* /app/lib/ollama/libcudart.so* /app/lib/libcudart.so* /usr/lib/x86_64-linux-gnu/GL/default/lib/libcudart.so* /usr/lib/x86_64-linux-gnu/openh264/extra/libcudart.so* /usr/lib/x86_64-linux-gnu/openh264/extra/libcudart.so* /usr/lib/sdk/llvm15/lib/libcudart.so* /usr/lib/x86_64-linux-gnu/GL/default/lib/libcudart.so* /usr/lib/ollama/libcudart.so* /app/plugins/AMD/lib/ollama/libcudart.so* /app/lib/ollama/libcudart.so* /app/lib/ollama/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2025-01-07T23:50:45.585+01:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths="[/app/lib/ollama/libcudart.so.11.3.109 /app/lib/ollama/libcudart.so.12.4.127]"
CUDA driver version: 12-2
time=2025-01-07T23:50:45.709+01:00 level=DEBUG source=gpu.go:149 msg="detected GPUs" library=/app/lib/ollama/libcudart.so.11.3.109 count=1
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] CUDA totalMem 8325824512
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] CUDA freeMem 8215592960
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] CUDA usedMem 0
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] Compute Capability 8.9
time=2025-01-07T23:50:45.762+01:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2025-01-07T23:50:45.762+01:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties"
time=2025-01-07T23:50:45.762+01:00 level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties"
time=2025-01-07T23:50:45.762+01:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties"
time=2025-01-07T23:50:45.762+01:00 level=DEBUG source=amd_linux.go:207 msg="mapping amdgpu to drm sysfs nodes" amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties vendor=4098 device=6400 unique_id=0
time=2025-01-07T23:50:45.763+01:00 level=DEBUG source=amd_linux.go:241 msg=matched amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties drm=/sys/class/drm/card0/device
time=2025-01-07T23:50:45.763+01:00 level=INFO source=amd_linux.go:297 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB"
time=2025-01-07T23:50:45.763+01:00 level=INFO source=amd_linux.go:404 msg="no compatible amdgpu devices detected"
releasing cudart library
time=2025-01-07T23:50:45.781+01:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7 library=cuda variant=v11 compute=8.9 driver=0.0 name="" total="7.8 GiB" available="7.7 GiB"
[GIN] 2025/01/07 - 23:50:45 | 200 |     549.407µs |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/01/07 - 23:50:45 | 200 |    14.87177ms |       127.0.0.1 | POST     "/api/show"
time=2025-01-07T23:51:30.731+01:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="30.6 GiB" before.free="21.4 GiB" before.free_swap="250.8 MiB" now.total="30.6 GiB" now.free="21.2 GiB" now.free_swap="250.8 MiB"
CUDA driver version: 12-2
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] CUDA totalMem 8325824512
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] CUDA freeMem 8215592960
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] CUDA usedMem 0
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] Compute Capability 8.9
time=2025-01-07T23:51:30.781+01:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7 name="" overhead="0 B" before.total="7.8 GiB" before.free="7.7 GiB" now.total="7.8 GiB" now.free="7.7 GiB" now.used="0 B"
releasing cudart library
time=2025-01-07T23:51:30.800+01:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x563123aea780 gpu_count=1
time=2025-01-07T23:51:30.823+01:00 level=DEBUG source=sched.go:224 msg="loading first model" model=/home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41
time=2025-01-07T23:51:30.823+01:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[7.7 GiB]"
time=2025-01-07T23:51:30.823+01:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[7.7 GiB]"
time=2025-01-07T23:51:30.824+01:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[7.7 GiB]"
time=2025-01-07T23:51:30.824+01:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[7.7 GiB]"
time=2025-01-07T23:51:30.825+01:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="30.6 GiB" before.free="21.2 GiB" before.free_swap="250.8 MiB" now.total="30.6 GiB" now.free="21.2 GiB" now.free_swap="250.8 MiB"
CUDA driver version: 12-2
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] CUDA totalMem 8325824512
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] CUDA freeMem 8215592960
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] CUDA usedMem 0
[GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7] Compute Capability 8.9
time=2025-01-07T23:51:30.859+01:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7 name="" overhead="0 B" before.total="7.8 GiB" before.free="7.7 GiB" now.total="7.8 GiB" now.free="7.7 GiB" now.used="0 B"
releasing cudart library
time=2025-01-07T23:51:30.878+01:00 level=INFO source=server.go:104 msg="system memory" total="30.6 GiB" free="21.2 GiB" free_swap="250.8 MiB"
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[7.7 GiB]"
time=2025-01-07T23:51:30.878+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=21 layers.split="" memory.available="[7.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.0 GiB" memory.required.partial="7.6 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[7.6 GiB]" memory.weights.total="18.0 GiB" memory.weights.repeating="17.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cpu_avx/ollama_llama_server
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cpu_avx2/ollama_llama_server
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cuda_v11_avx/ollama_llama_server
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cuda_v12_avx/ollama_llama_server
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/rocm_avx/ollama_llama_server
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cpu_avx/ollama_llama_server
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cpu_avx2/ollama_llama_server
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cuda_v11_avx/ollama_llama_server
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/cuda_v12_avx/ollama_llama_server
time=2025-01-07T23:51:30.878+01:00 level=DEBUG source=common.go:124 msg="availableServers : found" file=/app/lib/ollama/runners/rocm_avx/ollama_llama_server
time=2025-01-07T23:51:30.879+01:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/app/lib/ollama/runners/cuda_v11_avx/ollama_llama_server runner --model /home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 --ctx-size 2048 --batch-size 512 --n-gpu-layers 21 --verbose --threads 8 --parallel 1 --port 42899"
time=2025-01-07T23:51:30.879+01:00 level=DEBUG source=server.go:393 msg=subprocess environment="[LD_LIBRARY_PATH=/app/lib/ollama:/app/lib/ollama:/app/lib/ollama/runners/cuda_v11_avx:/app/lib:/usr/lib/x86_64-linux-gnu/GL/default/lib:/usr/lib/x86_64-linux-gnu/openh264/extra:/usr/lib/x86_64-linux-gnu/openh264/extra:/usr/lib/sdk/llvm15/lib:/usr/lib/x86_64-linux-gnu/GL/default/lib:/usr/lib/ollama:/app/plugins/AMD/lib/ollama PATH=/app/bin:/usr/bin CUDA_VISIBLE_DEVICES=GPU-afbafdd9-505a-55ff-00c0-1f07a0045ec7]"
time=2025-01-07T23:51:30.880+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-07T23:51:30.880+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41
time=2025-01-07T23:51:30.880+01:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-07T23:51:30.880+01:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-07T23:51:30.902+01:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
time=2025-01-07T23:51:31.020+01:00 level=INFO source=runner.go:946 msg=system info="CUDA : USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8
time=2025-01-07T23:51:31.020+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:42899"
time=2025-01-07T23:51:31.132+01:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Laptop GPU) - 7765 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 771 tensors from /home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 32B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 32B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv  12:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 64
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
llm_load_vocab: control token: 151648 '<|box_start|>' is not marked as EOG
llm_load_vocab: control token: 151646 '<|object_ref_start|>' is not marked as EOG
llm_load_vocab: control token: 151649 '<|box_end|>' is not marked as EOG
llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
llm_load_vocab: control token: 151647 '<|object_ref_end|>' is not marked as EOG
llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
llm_load_vocab: control token: 151644 '<|im_start|>' is not marked as EOG
llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 27648
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 18.48 GiB (4.85 BPW) 
llm_load_print_meta: general.name     = Qwen2.5 32B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 518 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors: offloading 21 repeating layers to GPU
llm_load_tensors: offloaded 21/65 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size = 18926.01 MiB
llm_load_tensors:        CUDA0 model buffer size =  5963.43 MiB
time=2025-01-07T23:51:40.913+01:00 level=DEBUG source=server.go:600 msg="model load progress 0.73"
time=2025-01-07T23:51:41.164+01:00 level=DEBUG source=server.go:600 msg="model load progress 0.83"
time=2025-01-07T23:51:41.414+01:00 level=DEBUG source=server.go:600 msg="model load progress 0.96"
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =   344.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   168.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.60 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   916.08 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 606 (with bs=512), 3 (with bs=1)
time=2025-01-07T23:51:41.665+01:00 level=INFO source=server.go:594 msg="llama runner started in 10.79 seconds"
time=2025-01-07T23:51:41.666+01:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41
time=2025-01-07T23:51:41.667+01:00 level=DEBUG source=routes.go:290 msg="generate request" images=0 prompt="<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n\nGenerate a title following these rules:\n    - The title should be based on the prompt at the end\n    - Keep it in the same language as the prompt\n    - The title needs to be less than 30 characters\n    - Use only alphanumeric characters and spaces\n    - Just write the title, NOTHING ELSE\n\n```PROMPT\nhello\n```<|im_end|>\n<|im_start|>assistant\n"
time=2025-01-07T23:51:41.668+01:00 level=DEBUG source=server.go:967 msg="new runner detected, loading model for cgo tokenization"
time=2025-01-07T23:51:41.741+01:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=103 used=0 remaining=103
llama_model_loader: loaded meta data with 34 key-value pairs and 771 tensors from /home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 32B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 32B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv  12:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 64
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 18.48 GiB (4.85 BPW) 
llm_load_print_meta: general.name     = Qwen2.5 32B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-01-07T23:51:42.126+01:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\n"
[GIN] 2025/01/07 - 23:51:49 | 200 | 18.589418215s |       127.0.0.1 | POST     "/api/generate"
time=2025-01-07T23:51:49.310+01:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2025-01-07T23:51:49.310+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 refCount=1
time=2025-01-07T23:51:49.311+01:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=107 prompt=30 used=23 remaining=7
time=2025-01-07T23:52:01.281+01:00 level=DEBUG source=sched.go:407 msg="context for request finished"
[GIN] 2025/01/07 - 23:52:01 | 200 | 30.561707659s |       127.0.0.1 | POST     "/api/chat"
time=2025-01-07T23:52:01.282+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 duration=5m0s
time=2025-01-07T23:52:01.282+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 refCount=0
time=2025-01-07T23:54:09.445+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/home/yrch/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41
time=2025-01-07T23:54:09.456+01:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today? Feel free to ask any questions or let me know if you need help with anything specific.<|im_end|>\n<|im_start|>user\nded<|im_end|>\n<|im_start|>assistant\n"
time=2025-01-07T23:54:09.460+01:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=57 prompt=68 used=57 remaining=11

image
image

task manager shows some GPU usage spikes at the beginning, but overall it continues to utilize and completely load CPU instead.

image
gpu memory is full tho(also in previous, task manager, screenshot). even if idling, I guess the model is chached there. is my GPU too weak?

@Jeffser
Copy link
Owner

Jeffser commented Jan 8, 2025

Alright I figured out what's happening here

@dvs81 is using a 70B model and @sequencerr is using a 32B model.

I think Ollama can't use models that can't fit completely in VRAM, those are really big models.

So yeah, you guys should probably use medium models for them to be able to fit in the VRAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants