github
GitHub APIKeeps: repo, release, stars delta
- 2026-05-04ggerganov/llama.cpp b9025: b9025
<details open> kleidiai : update to v1.24.0 and use release archive (#22549) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9025/llama-b9025-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enab
github:ggerganov/llama.cpp - 2026-05-04ggerganov/llama.cpp b9023: b9023
<details open> server: implement /models?reload=1 (#21848) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9023/llama-b9023-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://gith
github:ggerganov/llama.cpp - 2026-05-04ggerganov/llama.cpp b9020: b9020
<details open> common/autoparser: fixes for newline handling / forced tool calls (#22654) * chat/autoparser: the fixes * Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls. * Trim whitespace on apply inst
github:ggerganov/llama.cpp - 2026-05-04ggerganov/llama.cpp b9022: b9022
<details open> examples: refactor diffusion generation (#22590) * examples: refactor diffusion generation * renamed enum values </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9022/llama-b9022-bin-macos-arm64
github:ggerganov/llama.cpp - 2026-05-04ggerganov/llama.cpp b9019: b9019
<details open> model: move `load_hparams` and `load_tensors` to per-model definition (#22004) * git-friendly migration * add build_graph * nits * exclude old code from build * wip * add llm_arch_model_i * prepare downstream functions * nits * nits * wip * wip * add b
github:ggerganov/llama.cpp - 2026-05-04ggerganov/llama.cpp b9018: b9018
<details open> server: Add a simple get_datetime server tool (#22649) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9018/llama-b9018-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](h
github:ggerganov/llama.cpp - 2026-05-04ggerganov/llama.cpp b9016: b9016
<details open> docs : update speculative decoding parameters after refactor (#22397) (#22539) * docs : update speculative decoding parameters after refactor (#22397) Update docs/speculative.md to reflect the new parameter naming scheme introduced in PR #22397: - Replace --dra
github:ggerganov/llama.cpp - 2026-05-04ggerganov/llama.cpp b9015: b9015
<details open> vulkan: delete dead GGML_VK_MAX_NODES def (#22621) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9015/llama-b9015-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https
github:ggerganov/llama.cpp - 2026-05-04ggerganov/llama.cpp b9014: b9014
<details open> ggml-webgpu: add layer norm ops (#22406) * shader(norm): add layer norm ops * shader(norm): stablize floating point computation with Kahan summation and handle mixed types * shader(norm): remove the non-contiguous strides * shader(norm): use the original imple
github:ggerganov/llama.cpp - 2026-05-03ggerganov/llama.cpp b9012: b9012
<details open> convert : Mistral format yarn apply_scale support (#22612) * [BUGFIX] Mistral format apply_scale support. * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix misunderstood boolean parameters --------- Co-autho
github:ggerganov/llama.cpp - 2026-05-03vllm-project/vllm v0.20.1: v0.20.1
# vLLM v0.20.1 This is a patch release on top of `v0.20.0` primarily focused on **DeepSeek V4 stabilization and performance improvements**, along with several important bug fixes. ### DeepSeek V4 * Base model support (#41006). * Multi-stream pre-attention GEMM (#41061), c
github:vllm-project/vllm - 2026-05-02ggerganov/llama.cpp b9010: b9010
<details open> fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus entirely) (#22533) * fix: CUDA device PCI bus ID detection for multi-GPU de-dupe * HIP, MUSA macros --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> </details> **macOS/iOS:** - [ma
github:ggerganov/llama.cpp - 2026-05-02ggerganov/llama.cpp b9009: b9009
<details open> server : avoid checkpoint data host copies (#22558) * server : avoid checkpoint data host copies * llama : refactor llama_io_read_i </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9009/llama-b9
github:ggerganov/llama.cpp - 2026-05-02ggerganov/llama.cpp b9008: b9008
<details open> ggml-virtgpu: fix circular dependency in headers (#22557) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9008/llama-b9008-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)
github:ggerganov/llama.cpp - 2026-05-02ggerganov/llama.cpp b9006: b9006
<details open> opencl: Adreno optimization for MoE - MxFP4 (#22301) * MoE Mxfp4 CLC kernel added, router reorder on GPU * Pass test-backend-ops for MoE mxfp4 Adreno CLC * remove putenv in llama-model.cpp * fix indent style and whitespace * opencl: remove unnecessary headers
github:ggerganov/llama.cpp - 2026-05-02ggerganov/llama.cpp b9004: b9004
<details open> sync : ggml </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9004/llama-b9004-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releas
github:ggerganov/llama.cpp - 2026-05-02ggerganov/llama.cpp b9002: b9002
<details open> sync : ggml </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9002/llama-b9002-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releas
github:ggerganov/llama.cpp - 2026-05-02ggerganov/llama.cpp b9000: b9000
<details open> hexagon: hmx flash attention (#22347) * hmx: extract shared interleave headers and unify matmul batched * hmx: add HMX-accelerated flash attention for prefill * hmx: replace asm wrappers with Q6_ intrinsics in hmx-utils.h Switches three single-instruction help
github:ggerganov/llama.cpp - 2026-05-01ggerganov/llama.cpp b8999: b8999
<details open> llama-quant : fix `--tensor-type` when default `qtype` is overriden (#22572) fix #22544 (my fault!) Credit to @Anai-Guo, ref #22559 - since that one was closed due to the new contributor policy I am taking the liberty of re-submitting that PR here. </details>
github:ggerganov/llama.cpp - 2026-05-01ggerganov/llama.cpp b8998: b8998
<details open> hexagon: enable non-contiguous row tensor support for unary ops (#22574) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8998/llama-b8998-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, Kl
github:ggerganov/llama.cpp - 2026-05-01ggerganov/llama.cpp b8996: b8996
<details open> ggml-webgpu: Fix vectorized handling in mul-mat and mul-mat-id (#22578) * Fix vectorized condition of mul-mat-fast pipeline and add vectorized variant to mul-mat-id * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --
github:ggerganov/llama.cpp - 2026-05-01ggerganov/llama.cpp b8994: b8994
<details open> ggml-webgpu: add the upscale shader (#22419) * shader(upscale): add the upscale shader with nearest, bilinear and bicubic implementations * shader(upscale): use macro </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.c
github:ggerganov/llama.cpp - 2026-04-30ggerganov/llama.cpp b8992: b8992
<details open> Update llama-mmap to use ftello/fseeko (#22497) * Update llama-mmap to work with 32-bit wasm and >2GB models * Update to gguf.cpp style </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8992/llam
github:ggerganov/llama.cpp - 2026-04-30ggerganov/llama.cpp b8991: b8991
<details open> common : check for null getpwuid in hf-cache (#22550) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8991/llama-b8991-bin-macos-arm64.tar.gz
github:ggerganov/llama.cpp - 2026-04-30ggerganov/llama.cpp b8989: b8989
<details open> spec: fix argument typo (#22552) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8989/llama-b8989-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml
github:ggerganov/llama.cpp - 2026-04-30ggerganov/llama.cpp b8990: b8990
<details open> vulkan: add get/set tensor 2d functions (#22514) * vulkan: add get/set_tensor_2d functions * fix backend interface comments * Update ggml/src/ggml-metal/ggml-metal.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> </details> **macOS/iOS:** -
github:ggerganov/llama.cpp - 2026-04-30ggerganov/llama.cpp b8987: b8987
<details open> vendor : update cpp-httplib to 0.43.2 (#22548) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8987/llama-b8987-bin-macos-arm64.tar.gz) - [ma
github:ggerganov/llama.cpp - 2026-04-30ggerganov/llama.cpp b8986: b8986
<details open> CUDA: fix tile FA kernel on Pascal (#22541) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8986/llama-b8986-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://gith
github:ggerganov/llama.cpp - 2026-04-30ggerganov/llama.cpp b8984: b8984
<details open> add fast matmul iquants (#22504) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8984/llama-b8984-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml
github:ggerganov/llama.cpp - 2026-04-30ggerganov/llama.cpp b8983: b8983
<details open> spec : fix draft model checkpoints (#22521) * spec : fix draft model checkpoints * cont : clean-up * cont : gate the ngram-mod reset warning behind verbose flag </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/re
github:ggerganov/llama.cpp - 2026-04-30ggerganov/llama.cpp b8982: b8982
<details open> spec : fix vocab compat checks in spec example (#22426) * port #22358 PR to examples/speculative/speculative.cpp * use vocab_[tgt,dft] instead of ctx_[tgt,dft] when logging on draft model / target model vocabulary mismatch Co-authored-by: Petros Sideris <petro
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8981: b8981
<details open> common : do not pass prompt tokens to reasoning budget sampler (#22488) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8981/llama-b8981-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, Kle
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8980: b8980
<details open> hexagon: make vmem and buffer-size configurable (#22487) * hexagon: allow host to set max vmem size We use a sane default but it's helpful to allow for an override if needed. * hexagon: add support for measuring vmem space and move pinned mmaping management to
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8979: b8979
<details open> CUDA: fuse SSM_CONV + ADD(bias) + SILU (#22478) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8979/llama-b8979-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8978: b8978
<details open> spec : disacard last drafted token with low prob (#22506) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8978/llama-b8978-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8977: b8977
<details open> sync : ggml </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8977/llama-b8977-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releas
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8974: b8974
<details open> ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (#22293) * ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault vec_xst operations in the tiled path crash on AIX when writing near 4KB page boundaries due to strict memory prot
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8973: b8973
<details open> ggml-cuda: refactor fusion code (#22468) * ggml-cuda: refactor fusion code * apply formatting + make env variable truthy </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8973/llama-b8973-bin-mac
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8972: b8972
<details open> ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (#22317) * ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains inline asm for the vmadot family which requires the xsmtvdotii cust
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8971: b8971
<details open> ggml-webgpu: Fix bug in FlashAttention support check (#22492) * Fix flashattention support check for devices that don't support subgroups * set path to none if kv_tile doesn't fit </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggm
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8970: b8970
<details open> common: Intentionally leak logger instance to fix hanging on Windows (#22273) * Changed to leak logger singleton to prevent hanging on Windows * Fix comment * Stopped using static vector Using std::vector will cause g_col to be released before the logger thre
github:ggerganov/llama.cpp - 2026-04-29ggerganov/llama.cpp b8967: b8967
<details open> ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (#22196) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8967/llama-b8967-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiA
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8966: b8966
<details open> ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286) * ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32) Adds MMA-f16 and tile kernel configs, dispatch logic, template instances, and tile .cu file for Mistral
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8964: b8964
<details open> common : re-arm reasoning budget after DONE on new <think> (#22323) DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8963: b8963
<details open> vulkan: Coalesce Q4_K/Q5_K scale loads (#21751) Some SPIR-V compilers (notably mesa) don't handle the current vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well. While reading three `u8`s from the 12-byte scale array should (at least on some hardwar
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8962: b8962
<details open> ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (#22456) * Refactor buffer aliasing to be part of shader lib decisions * cleanup * formatting </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama
github:ggerganov/llama.cpp # Release v5.7.0 ## New Model additions ### Laguna <img width="699" height="176" alt="image" src="https://github.com/user-attachments/assets/d3bae269-bea7-4ddf-a53f-d4718befdb17" /> Laguna is Poolside's mixture-of-experts language model family that extends standard
github:huggingface/transformers- 2026-04-28ggerganov/llama.cpp b8960: b8960
<details open> vulkan: add barrier after writetimestamp (#21865) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8960/llama-b8960-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https:
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8958: b8958
<details open> ggml : skip already registered backends and devices (#22296) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8958/llama-b8958-bin-macos-arm64
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8957: b8957
<details open> ggml : revert to -lm linking instead of find_library (#22355) * ggml : revert to -lm linking instead of find_library `find_library(MATH_LIBRARY m)` was introduced recently, but it breaks CUDA compilation with GGML_STATIC. I could not find any valid use case wher
github:ggerganov/llama.cpp