github
GitHub APIKeeps: repo, release, stars delta
- 2026-05-14ggerganov/llama.cpp b9158: b9158
<details open> HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (#22880) Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9156: b9156
<details open> ggml-webgpu: Enable NVIDIA self-hosted CI (#22976) * Enabel nvidia ci for webgpu * Address precision issues * fix placement * Relax more set_rows and div * Try relaxing all f16 * formatting and naming * Add comment explaining max_nmse_err logic Added comme
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9151: b9151
<details open> logs : reduce (#23021) * logs : reduce * args : fix envs * server : fix build * common : print verbosity level at start * server : clean-up logs * server : print prompt processing timings + sampling params * minor : whitespaces </details> **macOS/iOS:** -
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9150: b9150
<details open> ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (#22863) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9150/llama-b9150-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, Kl
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9148: b9148
<details open> unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110) * unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests - Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtrackin
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9145: b9145
<details open> SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (#21597) * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9144: b9144
<details open> ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (#23020) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9144/llama-b9144-bin-macos-arm64.tar.gz) -
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9143: b9143
<details open> Fix for issue #22974. Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (#22994) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/r
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9142: b9142
<details open> opencl: add q5_0 and q5_1 MoE for Adreno (#22985) * opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcom
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9141: b9141
<details open> server, webui: accept continue_final_message flag for vLLM API compat (#23012) * server, webui: accept continue_final_message flag for vLLM API compat Add the continue_final_message body flag from the vLLM and transformers API. When set together with add_generat
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9140: b9140
<details open> opencl: fix crash when warming up MoE on Adreno (#22876) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9140/llama-b9140-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)]
github:ggerganov/llama.cpp