github
GitHub APIKeeps: repo, release, stars delta
- 2026-06-01ggerganov/llama.cpp b9464: b9464
<details open> speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function i
github:ggerganov/llama.cpp - 2026-06-01ggerganov/llama.cpp b9460: b9460
<details open> llama: limit max outputs of `llama_context` (#23861) * llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere </details> **macOS/iOS:** - [macOS
github:ggerganov/llama.cpp - 2026-06-01ggerganov/llama.cpp b9459: b9459
<details open> metal: template GLU kernels to support f16/f32 (#23882) Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid expl
github:ggerganov/llama.cpp - 2026-06-01ggerganov/llama.cpp b9458: b9458
<details open> vulkan: don't hold the device mutex while compiling pipelines (#23641) * vulkan: don't hold the device mutex while compiling pipelines We need to hold a lock while we traverse all pipelines and lazily initialize them, but we don't need to hold it while the pipel
github:ggerganov/llama.cpp - 2026-06-01ggerganov/llama.cpp b9457: b9457
<details open> vulkan: reduce host memory lock contention (#23376) * vulkan: reduces lock contention * replace unique_lock with lock_guard </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9457/llama-b9457-bin-
github:ggerganov/llama.cpp