github
GitHub APIKeeps: repo, release, stars delta
- 2026-05-30ggerganov/llama.cpp b9433: b9433
<details open> metal : restore im2col implementation for large kernels (#23901) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9433/llama-b9433-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI en
github:ggerganov/llama.cpp - 2026-05-30ggerganov/llama.cpp b9432: b9432
<details open> test: (test-llama-archs) log the config name first (#23885) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9432/llama-b9432-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled
github:ggerganov/llama.cpp - 2026-05-30ggerganov/llama.cpp b9431: b9431
<details open> ci : update ios-xcode release job to macos-26 (#23906) * ci : disable libcommon build from xcframework * ocd : fix name * ci : ios-xcode change to macos-26 * cont : pin xcode * cont : pin xcode to minor version </details> **macOS/iOS:** - [macOS Apple Silic
github:ggerganov/llama.cpp - 2026-05-30ggerganov/llama.cpp b9430: b9430
<details open> ggml : add some lsx support (#23798) * loongarch : optimize LSX fp16 load/store with native intrinsics Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in __lsx_f16x4_load and __lsx_f16x4_store. * loongarch : add LSX implementation for q8_0 dot
github:ggerganov/llama.cpp - 2026-05-30ggerganov/llama.cpp b9428: b9428
<details open> ci : fix s390x release job (#23898) * ci : fix s390x release job * ci : multi-thread build for `ios-xcode` * ocd : names </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9428/llama-b9428-bin-ma
github:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9415: b9415
<details open> download: add option to skip_download (#23059) * download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/
github:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9414: b9414
<details open> mtmd: Add DeepSeekOCR 2 Support (#20975) * mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution * introduced clip_image_f32::add_viewsep * address PR review - drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in
github:ggerganov/llama.cpp - 2026-05-29ROCm/ROCm rocm-7.2.4: ROCm 7.2.4 Release
# ROCm 7.2.4 release notes ROCm 7.2.4 is a quality release focused on performance and stability fixes for AI inference workloads on AMD Instinct GPUs. - [Release highlights](#release-highlights) - [ROCm binaries](#rocm-binaries) <a id="release-highlights"></a> ## Relea
AMDgithub:ROCm/ROCm - 2026-05-29ggerganov/llama.cpp b9413: b9413
<details open> CUDA: Check PTX version on host side to guard PDL dispatch (#23530) * CUDA: Check PTX version on host side to guard PDL dispatch Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this variable doesn't differentiate between compiling for say sm_9
github:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9412: b9412
<details open> server: bump timeout to 3600s (#23842) * server: bump timeout to 3600s * nits: change wording </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9412/llama-b9412-bin-macos-arm64.tar.gz) - macOS Ap
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9411: b9411
<details open> model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (#23346) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9410: b9410
<details open> llama: use f16 mask for FA to save VRAM (#23764) * llama: use f16 mask for FA * review: add llama_cast + formatting * simplify </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9410/llama-b9410-
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9409: b9409
<details open> sync : ggml </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9409/llama-b9409-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](https://github.com/ggml-org/llama.c
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9406: b9406
<details open> llama: add llm_graph_input_mtp (#23643) * llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> </details> **macOS/iOS:** - [m
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9405: b9405
<details open> app : move licences to llama-app (#23824) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9405/llama-b9405-bin-macos-arm64.tar.gz) - macOS Ap
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9404: b9404
<details open> cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9404/llama-b9404-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, Kle
AAPLgithub:ggerganov/llama.cpp - 2026-05-29vllm-project/vllm v0.22.0: v0.22.0
## Highlights This release features 459 commits from 230 contributors (63 new)! * **DeepSeek V4 maturity**: DeepSeek V4 received a major hardening pass this cycle — the model was reorganized into a dedicated `vllm/models/deepseek_v4/` package (#43004, #43039, #43073, #43077
github:vllm-project/vllm - 2026-05-29ggerganov/llama.cpp b9403: b9403
<details open> meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (#23480) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output. </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9402: b9402
<details open> hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23835) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case. </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/downloa
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9401: b9401
<details open> mtmd-debug: add color and rainbow mode (#23829) * mtmd-debug: add color and rainbow mode * fix M_PI * max_dist </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9401/llama-b9401-bin-macos-arm64.
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9400: b9400
<details open> mtmd: fix gemma 4 projector pre_norm (#23822) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9400/llama-b9400-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](h
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9399: b9399
<details open> opencl: move backend info printing into its own function (#23702) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.co
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9395: b9395
<details open> app : improve help output (#23805) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9395/llama-b9395-bin-macos-arm64.tar.gz) - macOS Apple Sil
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9394: b9394
<details open> mtmd: n_head_kv defaults to n_head (#23782) removed AI-generated comment </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9394/llama-b9394-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, Kl
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9393: b9393
<details open> mtmd: fix gemma 4 audio rms norm eps (#23815) * mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> </detail
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9391: b9391
<details open> arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9391/llama-b9391-bin-macos-arm64.tar.gz) - macOS Apple Silicon (ar
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9389: b9389
<details open> ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9389/llama-b9389-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI e
AAPLgithub:ggerganov/llama.cpp - 2026-05-29ggerganov/llama.cpp b9388: b9388
<details open> mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: J
AAPLgithub:ggerganov/llama.cpp <details><summary>Changelog Details</summary> - beep boop 🤖: Bumping versions by @svcnvidia-nemo-ci :: PR: #4349 - cp: `NVFP4 native weights for DDP (4005)` into `core_r0.17.0` by @ko3n1g :: PR: #4290 - docs: bump project.json and versions1.json to 0.17.0 by @ko3n1g :: PR: #
NVDAgithub:NVIDIA/Megatron-LM- 2026-05-27ggerganov/llama.cpp b9371: b9371
<details open> ggml-webgpu: remove legacy constants (#23672) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9371/llama-b9371-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](h
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9370: b9370
<details open> hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647) * hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9368: b9368
<details open> vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (#22887) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some te
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9369: b9369
<details open> ggml-webgpu: Fix how to dispatch WG to some ops (#23750) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9369/llama-b9369-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9367: b9367
<details open> vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (#23541) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9367/llama-b9367-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9366: b9366
<details open> vulkan: add REPEAT op support for f16 to f16. (#23298) * feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9365: b9365
<details open> ci : move ARM jobs to self-hosted + disable kleidiai mac release (#23780) * ci : move ARM jobs to 3rd-party runners + disable kleidiai release * cont : fix deps + fix names * ocd : fix names * cont : fix PR links </details> **macOS/iOS:** - [macOS Apple Sili
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9360: b9360
<details open> common : fix env names to all have LLAMA_ARG_ prefix (#23778) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9360/llama-b9360-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enab
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9357: b9357
<details open> vulkan: avoid preferring transfer queue on AMD UMA devices (#22455) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9357/llama-b9357-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiA
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9354: b9354
<details open> convert: add MiniCPM5 tokenizer support (#23384) Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.c
AAPLgithub:ggerganov/llama.cpp - 2026-05-27ggerganov/llama.cpp b9353: b9353
<details open> server : fix the log message when using SSL (#23393) When llama-server is started with SSL key and cert, the log says that it listens on http instead of https. This patch fixes this. </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/g
AAPLgithub:ggerganov/llama.cpp - 2026-05-27NVIDIA/cutlass v4.5.1: CUTLASS 4.5.1
## CuTe DSL * Bug fixing and improvements - Fixed following issues: https://github.com/NVIDIA/cutlass/issues/3219 https://github.com/NVIDIA/cutlass/issues/3218 https://github.com/NVIDIA/cutlass/issues/3212 https://github.com/NVIDIA/cutlass/issues/3210
NVDAgithub:NVIDIA/cutlass - 2026-05-26ggerganov/llama.cpp b9352: b9352
<details open> ggml-zendnn : fixed naming of matmul function (#20964) * ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru> <
AAPLgithub:ggerganov/llama.cpp - 2026-05-26ggerganov/llama.cpp b9351: b9351
<details open> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9351/llama-b9351-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releases/download
AAPLgithub:ggerganov/llama.cpp - 2026-05-26ggerganov/llama.cpp b9334: b9334
<details open> CUDA: missing PDL sync for FWHT, better fallback (#23690) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9334/llama-b9334-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)
AAPLgithub:ggerganov/llama.cpp - 2026-05-26ggerganov/llama.cpp b9333: b9333
<details open> metal : add apple device id (#23566) Co-authored-by: lvyichen <lvyichen@stepfun.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9333/llama-b9333-bin-macos-arm64.tar.gz) - [macOS Apple Silic
AAPLgithub:ggerganov/llama.cpp - 2026-05-26ggerganov/llama.cpp b9331: b9331
<details open> ci : reduce PR jobs by matching backend paths (#23675) * ci : disable SYCL f16 builds * ci : extract android and hip into separate workflows * ci : move webgpu to separate workflow * ci : move the rpc to a separate workflow * ci : extract s309x and ppcl jobs
AAPLgithub:ggerganov/llama.cpp - 2026-05-26ggerganov/llama.cpp b9330: b9330
<details open> model: tag ffn_latent as MUL_MAT to fix buft probe (#23664) ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks the backend about the declared op, so it tested an elementwise M
AAPLgithub:ggerganov/llama.cpp - 2026-05-26ggerganov/llama.cpp b9329: b9329
<details open> CUDA: add fast walsh-hadamard transform (#23615) * CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> </details> **macOS/iOS:** - [macOS Apple Sili
AAPLgithub:ggerganov/llama.cpp - 2026-05-26ggerganov/llama.cpp b9326: b9326
<details open> sync : ggml </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9326/llama-b9326-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releas
AAPLgithub:ggerganov/llama.cpp - 2026-05-25ggerganov/llama.cpp b9320: b9320
<details open> TP: fix ggml context size calculation (#22616) * TP: fix ggml context size calculation, memory leak * move split state cache back into the context * revert to constant ggml context size for cgraphs * increase headroom for statically allocated tensors * remove
AAPLgithub:ggerganov/llama.cpp