github
GitHub APIKeeps: repo, release, stars delta
- 2026-04-28ggerganov/llama.cpp b8966: b8966
<details open> ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286) * ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32) Adds MMA-f16 and tile kernel configs, dispatch logic, template instances, and tile .cu file for Mistral
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8964: b8964
<details open> common : re-arm reasoning budget after DONE on new <think> (#22323) DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8963: b8963
<details open> vulkan: Coalesce Q4_K/Q5_K scale loads (#21751) Some SPIR-V compilers (notably mesa) don't handle the current vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well. While reading three `u8`s from the 12-byte scale array should (at least on some hardwar
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8962: b8962
<details open> ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (#22456) * Refactor buffer aliasing to be part of shader lib decisions * cleanup * formatting </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama
github:ggerganov/llama.cpp # Release v5.7.0 ## New Model additions ### Laguna <img width="699" height="176" alt="image" src="https://github.com/user-attachments/assets/d3bae269-bea7-4ddf-a53f-d4718befdb17" /> Laguna is Poolside's mixture-of-experts language model family that extends standard
github:huggingface/transformers- 2026-04-28ggerganov/llama.cpp b8960: b8960
<details open> vulkan: add barrier after writetimestamp (#21865) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8960/llama-b8960-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https:
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8958: b8958
<details open> ggml : skip already registered backends and devices (#22296) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8958/llama-b8958-bin-macos-arm64
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8957: b8957
<details open> ggml : revert to -lm linking instead of find_library (#22355) * ggml : revert to -lm linking instead of find_library `find_library(MATH_LIBRARY m)` was introduced recently, but it breaks CUDA compilation with GGML_STATIC. I could not find any valid use case wher
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8956: b8956
<details open> CANN: add new ops, optimize existing ops (#21204) New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInp
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8955: b8955
<details open> spec : refactor params (#22397) * spec : refactor params * cont : fix * cont : rename "sparam" to "sampling" * cont : add spec params category * cont : add info about removed arguments * cont : skip param length check for spec params * cont : adapt server t
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8953: b8953
<details open> ggml-webgpu: add Q1_0 support (#22374) * add fast matmul matvec q1_0 kernel * ggml-webgpu: drop redundant zero-fills in Q1_0 shmem init </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8953/llam
github:ggerganov/llama.cpp