github
GitHub APIKeeps: repo, release, stars delta
- 2026-06-05vllm-project/vllm v0.22.1: v0.22.1
## Highlights This release features 8 commits from 6 contributors (1 new)! v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen
github:vllm-project/vllm - 2026-06-05ggerganov/llama.cpp b9524: b9524
<details open> minor : fix lint issues (#24165) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9524/llama-b9524-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](https://github
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9523: b9523
<details open> hparams : refactor `hparams.n_layer` (#24060) * hparams : refactor hparams.n_layer * cont : remove `n_layer_kv()`, use n_layer_all instead * cont : type consistency * pi : update SYSTEM.md * models : fix Step3.5 MTP * cont : remove duplicate switch cases *
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9522: b9522
<details open> kleidiai : dynamic chunck-based scheduling for hybrid execution (#23819) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9522/llama-b9522-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, Kle
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9521: b9521
<details open> CUDA: enroll mul_mat_vec_q_moe into pdl (#24087) * Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc=
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9519: b9519
<details open> sycl : port multi-column MMVQ from CUDA backend (#21845) mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K,
github:ggerganov/llama.cpp # Patch release v5.10.2 There was a big bug in the model conversion of models related to clip, this affected models like sam3 and others. Please make sure to update :pray: * Fix conversion for clip models by @zucchini-nlp (#46406) **Full Changelog**: https://github.com/
github:huggingface/transformers- 2026-06-04ggerganov/llama.cpp b9518: b9518
<details open> server : disable on-device spec checkpoints (#24108) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9518/llama-b9518-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISA
github:ggerganov/llama.cpp - 2026-06-04ggerganov/llama.cpp b9515: b9515
<details open> Move duplicated imatrix code into single common imatrix-loader.cpp (#22445) * Deduplicate imatrix loading code * Add back LLAMA_TRACE, early exit on quantize missing metadata </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org
github:ggerganov/llama.cpp - 2026-06-04ggerganov/llama.cpp b9512: b9512
<details open> return filter to save memory (#24125) Co-authored-by: lvyichen <lvyichen@stepfun.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9512/llama-b9512-bin-macos-arm64.tar.gz) - macOS Apple Silic
github:ggerganov/llama.cpp - 2026-06-04ggerganov/llama.cpp b9510: b9510
<details open> ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (#22209) * ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 Optimize the inner loop of ggml_vec_dot_q4_1_q8_1_generic using WASM SIMD128 intrinsics, gated behind #ifdef __wasm_simd128__ so non-wasm
github:ggerganov/llama.cpp - 2026-06-04ggerganov/llama.cpp b9509: b9509
<details open> server: avoid unnecessary checkpoint restore when new tokens are present (#24110) * server: avoid unnecessary checkpoint restore when new tokens are present The pos_min_thold calculation unconditionally subtracts 1 to ensure at least one token is evaluated for l
github:ggerganov/llama.cpp - 2026-06-04ggerganov/llama.cpp b9500: b9500
<details open> metal : reduce rset heartbeat from 500ms -> 5ms (#24074) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9500/llama-b9500-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [
github:ggerganov/llama.cpp - 2026-06-04ggerganov/llama.cpp b9499: b9499
<details open> ggml-webgpu: FlashAttention refactor + standardize quantization support (#23834) * Start work on flash_attn refactor * Refactor * Split k/v quantization * Refactor and abstract quantization logic for flash_attn and mul_mat * Add quantization support to tile p
github:ggerganov/llama.cpp - 2026-06-04ggerganov/llama.cpp b9498: b9498
<details open> ggml-cpu: extend RVV quantization vec dot to higher VLENs (#22754) * ggml-cpu: add rvv 512b,1024b impls for iq4_xs * ggml-cpu: refactor; add rvv 512b, 1024b impls for q6_K, i-quants * ggml-cpu: refactor; add 512 and 1024 implementations of tq3_s, iq3_xxs, iq2_s
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9496: b9496
<details open> mtmd: fix Gemma 4 unified FPE (#24088) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9496/llama-b9496-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](https://
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9495: b9495
<details open> qwen35: use post-norm hidden state for MTP (#24025) * qwen35: use post-norm hidden state for MTP * rename pre_norm to nextn * fix step35 </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9495/ll
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9494: b9494
<details open> mtmd: enable non-causal vision for gemma 4 unified (#24082) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9494/llama-b9494-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9493: b9493
<details open> mtmd, model: allow skip build_vit() (#24077) * add model * nits </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9493/llama-b9493-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI e
github:ggerganov/llama.cpp # Release v5.10.1 v5.10.0 was yanked as we publish on a corrupted branch. Sorry everyone, this happens when we rush a release!!! ## New Model additions ### Gemma4 unified+ Gemma4 MTP <img width="2000" height="400" alt="image" src="https://github.com/user-attachments/asse
github:huggingface/transformers- 2026-06-03ggerganov/llama.cpp b9491: b9491
<details open> Avoid PDL race conditions by disabling __restrict__ when PDL is used (#24030) * Removes __restrict__ from PDL kernel headers due to incompatibility with PDL. Adds preprocessor directives based on arch in kernel body to add __restrict__ to retain performance on ol
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9490: b9490
<details open> ggml-cpu: use runtime SVE width in FWHT (#24059) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9490/llama-b9490-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9489: b9489
<details open> cuda: reserve space for quantize kv-cache at startup (#23907) * cuda: reserve space for quantize kv-cache at startup * address review comments * remove forward decl Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * remove assert in ggml-cuda.cu Co-authore
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9488: b9488
<details open> tests : add support for qwen3 SSM archs (#24031) * tests : add support for qwen3 SSM archs * arch : add LLM_KV_ATTENTION_RECURRENT_LAYERS * cont : naming + TODOs </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/r
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9487: b9487
<details open> update BoringSSL to 0.20260526.0 (#23794) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9487/llama-b9487-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](https
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9486: b9486
<details open> ci : disable ccache for msvc windows release jobs (#23911) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9486/llama-b9486-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled)
github:ggerganov/llama.cpp - 2026-06-03ggerganov/llama.cpp b9485: b9485
<details open> arg : removed unecesary mmproj download when users pass --no-mmproj (#23425) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9485/llama-b9485-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64,
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9484: b9484
<details open> opencl: use flat variants of q4_K and q6_K gemv for very large M (#24006) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9484/llama-b9484-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, Kl
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9483: b9483
<details open> hexagon: profiler output fix and script updates (#24042) * hex-ops: fix profiler output (ie remove the redundant NONEs) * hex-prof: update profiling script to support tot.usec column </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9482: b9482
<details open> model: add Mellum architecture (#23966) * model: support for Mellum architecture * model: improve mellum.py formatting * model: improve mellum.py formatting once again * deps: downgrade transformers to 4.57.6 (to fix CI) * deps: remove huggingface_hub depende
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9481: b9481
<details open> model : support granite multilingual embeddings R2 (ibm-granite/granite-embedding-{97,311}m-multilingual-r2) (#22716) * Add support for the ibm-granite/granite-embedding-{97m,311m}-multilingual-r2 embedding models: * Added a version of the gpt4o tokenizer that h
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9480: b9480
<details open> StepFun 3.5 MTP (#23274) * StepFun 3.5 MTP * Simplify to single layer * Rollback core changes * fix flake8 errors * Remove scripts * modify to convention * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> *
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9479: b9479
<details open> common : fix state save in common_prompt_batch_decode (#23468) * common : fix state save in common_prompt_batch_decode This commit addresses a bug in common_prompt_batch_decode that affects the session state store/restore in completion.cpp and save-load-state.cp
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9470: b9470
<details open> hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (#23989) * hex-mm: initial support for F32 * F32 -> F32 matmuls * hex-rms-norm: fix src1 stride use in fused rms_norm_mul * hex-ops: clear spad pointers in the ops tha
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9469: b9469
<details open> hexagon: add gelu_quick (#24007) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9469/llama-b9469-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](https://github
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9468: b9468
<details open> server: real-time reasoning interruption via control endpoint (#23971) * server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot an
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9467: b9467
<details open> clean up unused variables warnings (#23975) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9467/llama-b9467-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](htt
github:ggerganov/llama.cpp - 2026-06-02ggerganov/llama.cpp b9466: b9466
<details open> opencl: fix compiler warnings for non-adreno path (#23922) * opencl: fix compiler warnings for non-adreno path * opencl: fix const cast warning </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9
github:ggerganov/llama.cpp - 2026-06-01ggerganov/llama.cpp b9464: b9464
<details open> speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function i
github:ggerganov/llama.cpp - 2026-06-01ggerganov/llama.cpp b9460: b9460
<details open> llama: limit max outputs of `llama_context` (#23861) * llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere </details> **macOS/iOS:** - [macOS
github:ggerganov/llama.cpp - 2026-06-01ggerganov/llama.cpp b9459: b9459
<details open> metal: template GLU kernels to support f16/f32 (#23882) Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid expl
github:ggerganov/llama.cpp - 2026-06-01ggerganov/llama.cpp b9458: b9458
<details open> vulkan: don't hold the device mutex while compiling pipelines (#23641) * vulkan: don't hold the device mutex while compiling pipelines We need to hold a lock while we traverse all pipelines and lazily initialize them, but we don't need to hold it while the pipel
github:ggerganov/llama.cpp - 2026-06-01ggerganov/llama.cpp b9457: b9457
<details open> vulkan: reduce host memory lock contention (#23376) * vulkan: reduces lock contention * replace unique_lock with lock_guard </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9457/llama-b9457-bin-
github:ggerganov/llama.cpp - 2026-05-31ggerganov/llama.cpp b9444: b9444
<details open> server : handle If-None-Match weak ETags (#23916) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9444/llama-b9444-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLE
github:ggerganov/llama.cpp - 2026-05-31ggerganov/llama.cpp b9442: b9442
<details open> vocab : add tokenizer support for jina-embeddings-v2-base-zh (#18756) * vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * lowercase defaults to true * type fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> </details>
github:ggerganov/llama.cpp - 2026-05-31ggerganov/llama.cpp b9441: b9441
<details open> ui: fix ETag truncation with MSVC compiler (#23917) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9441/llama-b9441-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISAB
github:ggerganov/llama.cpp - 2026-05-31ggerganov/llama.cpp b9439: b9439
<details open> llama: only use one iGPU device by default (#23897) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9439/llama-b9439-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISAB
github:ggerganov/llama.cpp - 2026-05-30ggerganov/llama.cpp b9437: b9437
<details open> Support `-fa auto` in llama-bench (#23714) * Support `-fa auto` in llama-bench Make the default value of `-ngl` -1, similar to other tools. Update README with latest usage and examples * Address review comments </details> **macOS/iOS:** - [macOS Apple Silico
github:ggerganov/llama.cpp - 2026-05-30ggerganov/llama.cpp b9436: b9436
<details open> opencl: support bf16 by converting to f16 (#23839) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9436/llama-b9436-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABL
github:ggerganov/llama.cpp - 2026-05-30ggerganov/llama.cpp b9434: b9434
<details open> TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (#23843) * TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs * fix afmoe TP </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9434/llama-b9434-bin-macos-
github:ggerganov/llama.cpp