github
GitHub APIKeeps: repo, release, stars delta
- 2026-05-25ggerganov/llama.cpp b9319: b9319
<details open> ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (#22341) * ggml: implement `gguf_init_from_buffer` * test: `gguf_init_from_buffer` * fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer
AAPLgithub:ggerganov/llama.cpp - 2026-05-25ggerganov/llama.cpp b9318: b9318
<details open> server: MTP layer kv-cache should respect draft type ctk (#23646) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9318/llama-b9318-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI
AAPLgithub:ggerganov/llama.cpp - 2026-05-25ggerganov/llama.cpp b9315: b9315
<details open> llama : document that only one on-device state can be saved per sequence (#23520) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9315/llama-b9315-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (
AAPLgithub:ggerganov/llama.cpp - 2026-05-24ggerganov/llama.cpp b9305: b9305
<details open> cmake : fix ui build (#23592) * cmake/ui : add -fPIC to llama-ui static lib * cmake : rename host compiled embed helper </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9305/llama-b9305-bin-maco
AAPLgithub:ggerganov/llama.cpp - 2026-05-23ggerganov/llama.cpp b9297: b9297
<details open> model : add NVFP4 MTP scale tensors (#23563) * Add NVFP4 MTP scale tensors * Link Qwen3.5 MTP tensors * Aligned nullptr </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9297/llama-b9297-bin-mac
AAPLgithub:ggerganov/llama.cpp - 2026-05-23ggerganov/llama.cpp b9296: b9296
<details open> ggml : Check the right iface method before using the fallback 2d get (#23514) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9296/llama-b9296-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm6
AAPLgithub:ggerganov/llama.cpp - 2026-05-23ggerganov/llama.cpp b9295: b9295
<details open> vulkan: fix windows find_package of SPIRV-Headers (#23215) * vulkan: fix windows find_package of SPIRV-Headers * not windows-only </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9295/llama-b929
AAPLgithub:ggerganov/llama.cpp - 2026-05-23ggerganov/llama.cpp b9294: b9294
<details open> opencl: generalize Adreno MoE kernels on M (#23449) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9294/llama-b9294-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](http
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9291: b9291
<details open> SYCL: improve MoE prefill throughput (#23142) - change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends - switch the `O(n_as * n_routed_rows)` contraption to
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9292: b9292
<details open> perplexity : fix integer overflow (#23496) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9292/llama-b9292-bin-macos-arm64.tar.gz) - [mac
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9290: b9290
<details open> sycl : Level Zero detection in ggml_sycl_init (#23097) * [SYCL] Centralize Level Zero detection in ggml_sycl_init * use the same wording * get back the warning </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/rel
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9289: b9289
<details open> SYCL : gated_delta_net K>1 (#23174) * sycl_gated_delta_net K>1 * editor_config </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9289/llama-b9289-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (a
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9286: b9286
<details open> ggml-zendnn : add Q8_0 quantization support (#23414) * ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0 </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://githu
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9285: b9285
<details open> cmake : build router app only during standalone builds (#23521) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9285/llama-b9285-bin-macos
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9284: b9284
<details open> vocab : fix HybridDNA tokenizer (#23466) * vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9283: b9283
<details open> cmake : add install() for impl libraries + fix apple builds (#23511) * pi : update * ci : fix ios build * ci : fix andoroid * ci : fix apple builds * cmake : add install() for impl libraries Add install(TARGETS <target> LIBRARY) for all -impl libraries that
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9279: b9279
<details open> vulkan: fuse snake activation (mul, sin, sqr, mul, add) (#22855) * vulkan: fuse snake activation (mul, sin, sqr, mul, add) Add snake.comp shader with F32 / F16 / BF16 pipelines and ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op decomposition
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9277: b9277
<details open> tests : move save-load-state from examples to tests (#23336) * tests : move save-load-state from examples to tests - Move examples/save-load-state/ to tests/test-save-load-state.cpp - Remove subdirectory reference from examples/CMakeLists.txt - Add test to tests
AAPLgithub:ggerganov/llama.cpp - 2026-05-22ggerganov/llama.cpp b9276: b9276
<details open> server: expose prompt token counts in /slots endpoint (#23454) Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache to the /slots JSON response. These fields are already tracked internally but were not exposed, making it impossible for client
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9275: b9275
<details open> metal : optimize concat kernel and fix set kernel threads (#23411) * metal : fix GGML_OP_SET kernel threads * tests : extend test_cpy to support different src/dst shapes Extend test_cpy to support different source and destination tensor shapes for CPY operation
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9274: b9274
<details open> server : free draft/MTP resources on sleep to fix VRAM leak (#23461) The destroy() function in server_context_impl only cleaned up the main model and context (via llama_init.reset()) but did not free the speculative decoder (spec), draft context (ctx_dft), or dra
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9273: b9273
<details open> server: re-inject subcommand when router spawns children under unified binary (#23442) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9273/llama-b9273-bin-macos-arm64.tar.gz) - [macOS Apple Sili
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9272: b9272
<details open> app : add batched-bench, fit-params, quantize & perplexity (#23459) * app : add batched-bench, fit-params, quantize & perplexity Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add missing main.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> *
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9271: b9271
<details open> mtp: use inp_out_ids for skipping logit computation (#23433) when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required. </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.c
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9270: b9270
<details open> vocab : add Carbon-3B (HybridDNATokenizer) support (#23410) * vocab : add Carbon-3B (HybridDNATokenizer) support Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-B
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9267: b9267
<details open> ggml : Check the right iface method before using the fallback 2d get (#23306) Probably no backends implement only one of 2d get/set, but this might be annoying for some future backend developer trying to add 2d get/set. </details> **macOS/iOS:** - [macOS Apple
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9264: b9264
<details open> app : show version (#23426) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9264/llama-b9264-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9263: b9263
<details open> mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (#23329) - HunyuanOCR shares the same HF arch and vision layout as HunyuanVL butwas split into a separate path that skipped the +0.1 bilinear sampler used by the HF reference. - Collapse O
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9260: b9260
<details open> opencl: refactor backend initilization (#23318) * opencl: refactor initialization * opencl: refactor GPU identification * opencl: rename for consistency * opencl: cache global mem size in dev_ctx * opencl: adjust log level * opencl: load argsort and flash_at
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9259: b9259
<details open> common/speculative : fix nullptr crash in get_devices_str (#23386) ggml_backend_dev_by_name always appends a nullptr sentinel to the devices vector. Skipping nullptr entries prevents assertion failure in ggml_backend_dev_name. Assisted-by: llama.cpp:local pi </
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9258: b9258
<details open> mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (#23345) * mtmd : deepseek-ocr fixes, improvements and refactoring - image processing changes to achieve full parity with Pillow (reference impl) - SAM mask casting only when flash-att
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9257: b9257
<details open> vulkan: optimize operations in the IM2COL shader (#22685) * vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases
AAPLgithub:ggerganov/llama.cpp - 2026-05-21ggerganov/llama.cpp b9255: b9255
<details open> hexagon: HMX quantized matmul rework (#23368) * hmx-mm: update debug logging in hmx-mm * hmx-mm: update dequant logic to use HVX_vector_x2/4 * hmx-mm: remove non-pipelined version of the quantize matmul It seems that we don't reall need non-pipelined version
AAPLgithub:ggerganov/llama.cpp - 2026-05-20ggerganov/llama.cpp b9254: b9254
<details open> Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (#22522) * Adds initial PDL setup. * Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tenso
AAPLgithub:ggerganov/llama.cpp - 2026-05-20ggerganov/llama.cpp b9253: b9253
<details open> app : introduce the llama unified executable (#23296) * app : introduce the llama unified executable Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use serve for server Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Hide completion and bench,
AAPLgithub:ggerganov/llama.cpp - 2026-05-20ggerganov/llama.cpp b9251: b9251
<details open> mtmd: fit_params now take into account mmproj (#21489) * mtmd: fit_params now take into account mmproj * rename alloc_compute_meta to reserve_compute_meta * rm unused functions * add ggml_backend_dev_t support * add debug log </details> **macOS/iOS:** - [ma
AAPLgithub:ggerganov/llama.cpp # Release v5.9.0 ## New Model additions ### Cohere2Moe Command A+ is a Mixture-of-Experts (MoE) language model from Cohere that features a hybrid attention pattern combining sliding window and full attention layers. The model incorporates both shared and routed experts
COHEREgithub:huggingface/transformers- 2026-05-19ggerganov/llama.cpp b9222: b9222
<details open> hexagon: add support for TRI op (#22822) * Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context * addressed PR review comments for TRI op * hexagon: clang format * hex-unary: remove merge conflict markers * hex-ggml: remove duplicate op cases
AAPLgithub:ggerganov/llama.cpp - 2026-05-18ggerganov/llama.cpp b9221: b9221
<details open> ggml-hexagon: add PAD op HVX kernel (#23078) * ggml-hexagon: add PAD op HVX kernel Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized kernels. Supports zero-padding and circular padding across all 4 tensor dimensions. * hex-ggml: remove dupl
AAPLgithub:ggerganov/llama.cpp - 2026-05-18ggerganov/llama.cpp b9219: b9219
<details open> common : remove hf cache migration (#23266) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9219/llama-b9219-bin-macos-arm64.tar.gz) - [macOS
AAPLgithub:ggerganov/llama.cpp - 2026-05-18ggerganov/llama.cpp b9216: b9216
<details open> ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (#23236) * refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars * refactor: skip MCP proxy probe when no server requires it * refactor: suppress expected disconnect errors during M
AAPLgithub:ggerganov/llama.cpp - 2026-05-18ggerganov/llama.cpp b9213: b9213
<details open> llama: initialize pre-norm embedding mask flag (#23256) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9213/llama-b9213-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](
AAPLgithub:ggerganov/llama.cpp - 2026-05-18ggerganov/llama.cpp b9209: b9209
<details open> sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (#22156) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases
AAPLgithub:ggerganov/llama.cpp - 2026-05-18ggerganov/llama.cpp b9208: b9208
<details open> sycl: route small f32 matmuls to oneMKL, bypass oneDNN (#22150) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/d
AAPLgithub:ggerganov/llama.cpp - 2026-05-18ggerganov/llama.cpp b9204: b9204
<details open> feat: Support d_conv=15 for ssm-conv.cu (#23017) Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download
AAPLgithub:ggerganov/llama.cpp - 2026-05-18ggerganov/llama.cpp b9203: b9203
<details open> cmake : fix LLAMA_BUILD_UI logic (#23190) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9203/llama-b9203-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github
AAPLgithub:ggerganov/llama.cpp - 2026-05-17ggerganov/llama.cpp b9202: b9202
<details open> cmake : do not install conversion script (#23204) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9202/llama-b9202-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https:
AAPLgithub:ggerganov/llama.cpp - 2026-05-17ggerganov/llama.cpp b9200: b9200
<details open> llama: avoid copying logits during prompt decode in MTP (#23198) * llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https
AAPLgithub:ggerganov/llama.cpp - 2026-05-17ggerganov/llama.cpp b9198: b9198
<details open> ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (#22009) * ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI For whatever reason, the files are under additional sub-path `vulkan/` under the cmake directory, which does not match either cur
AAPLgithub:ggerganov/llama.cpp - 2026-05-17ggerganov/llama.cpp b9197: b9197
<details open> vulkan: add cpy bf16 -> f32 pipelines (#22677) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9197/llama-b9197-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://g
AAPLgithub:ggerganov/llama.cpp