github
GitHub APIKeeps: repo, release, stars delta
- 2026-05-11ggerganov/llama.cpp b9102: b9102
<details open> [SYCL] Add OP im2col_3d (#22903) * add im2col_3d * format code * update the ops.md </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9102/llama-b9102-bin-macos-arm64.tar.gz) - [macOS Apple Silic
github:ggerganov/llama.cpp - 2026-05-10ggerganov/llama.cpp b9101: b9101
<details open> server : print warning when HTTP timeout exceeded (#22907) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9101/llama-b9101-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled
github:ggerganov/llama.cpp - 2026-05-10ggerganov/llama.cpp b9100: b9100
<details open> backend sampling: support returning post-sampling probs (#22622) * server: Never return 0.0 post-sampling probabilities * backend sampling: support returning post-sampling probs </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-
github:ggerganov/llama.cpp - 2026-05-10ggerganov/llama.cpp b9099: b9099
<details open> vendor : update cpp-httplib to 0.43.4 (#22888) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9099/llama-b9099-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://g
github:ggerganov/llama.cpp - 2026-05-10ggerganov/llama.cpp b9097: b9097
<details open> sync : ggml </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9097/llama-b9097-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releas
github:ggerganov/llama.cpp - 2026-05-10ggerganov/llama.cpp b9095: b9095
<details open> internal AllReduce kernel for CUDA provider (#22299) * ggml-cuda: add internal AllReduce provider for tensor parallelism Introduces a NCCL-free AllReduce implementation for LLAMA_SPLIT_MODE_TENSOR using a single-phase CUDA kernel that pipelines D2H copy, cross-G
github:ggerganov/llama.cpp - 2026-05-10ggerganov/llama.cpp b9094: b9094
<details open> model : fix model type check for granite/llama3 and deepseek2/glm4.7 lite (#22870) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9094/llama-b9094-bin-macos-arm64.tar.gz) - [macOS Apple Silicon
github:ggerganov/llama.cpp - 2026-05-10vllm-project/vllm v0.20.2: v0.20.2
# vLLM v0.20.2 ## Highlights This release features 6 commits from 6 contributors (0 new)! This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL ### Bug Fixes * **DeepSeek V4 sparse attention**: Re-enable the persistent topk path on Hopper
github:vllm-project/vllm - 2026-05-09ggerganov/llama.cpp b9093: b9093
<details open> model : add sarvam_moe architecture support (#20275) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9093/llama-b9093-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](htt
github:ggerganov/llama.cpp - 2026-05-09ggerganov/llama.cpp b9090: b9090
<details open> cmake : update BoringSSL to 0.20260508.0 (#22839) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9090/llama-b9090-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https:
github:ggerganov/llama.cpp - 2026-05-09ggerganov/llama.cpp b9089: b9089
<details open> SYCL: reduce allocation overhead during flash attention (#22732) * SYCL: reduce allocation overhead during flash attention * tidy up whitespace * add a note about the flag * move ggml_sycl_fattn_* into fattn-buffers.hpp * refactor implementation into fattn-bu
github:ggerganov/llama.cpp - 2026-05-09ggerganov/llama.cpp b9088: b9088
<details open> [SYCL] Add BF16 support to GET_ROWS operation (#21391) Add GGML_TYPE_BF16 to the SYCL backend's GET_ROWS operation, both in supports_op and in the kernel dispatch. This fixes a performance regression where models using BF16 embedding tensors (e.g., Gemma4's per_l
github:ggerganov/llama.cpp - 2026-05-09ggerganov/llama.cpp b9087: b9087
<details open> sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path (#22152) * sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path Signed-off-by: Chun Tao <chun.tao@intel.com> * Remove duplicate definitions --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Co-
github:ggerganov/llama.cpp - 2026-05-09ggerganov/llama.cpp b9085: b9085
<details open> Add flash attention MMA / Tiles to support MiMo-V2.5 (#22812) * mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128 * mimo-v2.5: follow (256, 256) fattn templates * mimo-v2.5: cleanup comments * mimo-v2.5: further comment cleanup * mimo-v2.5: ad
github:ggerganov/llama.cpp - 2026-05-09ggerganov/llama.cpp b9084: b9084
<details open> hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET (#22837) Implement the Gated Delta Net recurrence on HVX with: - 4-row fused kernels for PP (prompt processing) path - 8-row fused kernels for TG (token generation) path, reducing K/Q/gate vector reload overhe
github:ggerganov/llama.cpp - 2026-05-08ggerganov/llama.cpp b9082: b9082
<details open> Feature hexagon l2 norm (#22816) * L2_NORM Updates * Addressed PR Comments * ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend * hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops --------- Co-authored-by: Max Kras
github:ggerganov/llama.cpp - 2026-05-08ggerganov/llama.cpp b9081: b9081
<details open> common : do not wrap raw strings in schema parser for tagged parsers (#22827) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9081/llama-b9081-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm6
github:ggerganov/llama.cpp - 2026-05-08ggerganov/llama.cpp b9080: b9080
<details open> model : support Gemma4_26B_A4B_NVFP4 (#22804) * Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani <ynankani@nvidia.com> * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> *
github:ggerganov/llama.cpp - 2026-05-08ggerganov/llama.cpp b9079: b9079
<details open> common : revert reasoning budget +inf logit bias (#22740) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9079/llama-b9079-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)
github:ggerganov/llama.cpp - 2026-05-08ggerganov/llama.cpp b9077: b9077
<details open> server: support Vertex AI compatible API (#22545) * server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build </details> **macOS/iOS:** - [ma
github:ggerganov/llama.cpp - 2026-05-08ggerganov/llama.cpp b9076: b9076
<details open> server: (router) expose child model info from router's /v1/models (#22683) * server: (router) expose child model info from router's /v1/models * update docs </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/release
github:ggerganov/llama.cpp - 2026-05-08ggerganov/llama.cpp b9075: b9075
<details open> cuda: fuse snake activation (mul, sin, sqr, mul, add) (#22667) * cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (Bi
github:ggerganov/llama.cpp - 2026-05-08ggerganov/llama.cpp b9073: b9073
<details open> CUDA: lower-case PCI bus id, standardize for ggml (#22820) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9073/llama-b9073-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled
github:ggerganov/llama.cpp - 2026-05-08ggerganov/llama.cpp b9070: b9070
<details open> opencl: add q4_0 MoE GEMM for Adreno (#22731) * Q4_0 MoE CLC pass sanity check * release program * opencl: fix whitespace * opencl: remove unused cl_program * opencl: break #if block to make it more clear * opencl: adjust format --------- Co-authored-by: L
github:ggerganov/llama.cpp - 2026-05-07ggerganov/llama.cpp b9066: b9066
<details open> CUDA: batch out_prod inner loop with cublasSgemmStridedBatched (#22651) * CUDA: batch out_prod inner loop with cublasSgemmStridedBatched * CUDA: batch out_prod inner loop with cublasSgemmStridedBatched * CUDA: add cublasSgemmStridedBatched mapping for HIP and M
github:ggerganov/llama.cpp - 2026-05-07ggerganov/llama.cpp b9064: b9064
<details open> llama : fix device state save/load (#22805) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9064/llama-b9064-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://gith
github:ggerganov/llama.cpp - 2026-05-07ggerganov/llama.cpp b9063: b9063
<details open> opencl: add opfilter regex for debugging (#22782) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9063/llama-b9063-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https:
github:ggerganov/llama.cpp ## Table of Contents - [Dialect & Frontend](#dialect--frontend) - [Backend & Compiler](#backend--compiler) - [AMD/HIP Backend](#amdhip-backend) - [NVIDIA Backend](#nvidia-backend) - [Gluon & Layout Improvements](#gluon--layout-improvements) - [Kernels & Benchmarks](#kernels
github:openai/triton## Table of Contents - [Dialect & Frontend](#dialect--frontend) - [Backend & Compiler](#backend--compiler) - [AMD/HIP Backend](#amdhip-backend) - [NVIDIA Backend](#nvidia-backend) - [Gluon & Layout Improvements](#gluon--layout-improvements) - [Kernels & Benchmarks](#kernels
github:triton-lang/triton- 2026-05-07ggerganov/llama.cpp b9062: b9062
<details open> common/chat : preserve media markers for typed-content templates (#22634) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9062/llama-b9062-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, K
github:ggerganov/llama.cpp - 2026-05-07ggerganov/llama.cpp b9061: b9061
<details open> tests: add long-sequence cases and fix inputs for gated_delta_net (#22794) * tests : add long-seq + tail cases for gated_delta_net * tests : realistic input ranges for gated_delta_net </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com
github:ggerganov/llama.cpp - 2026-05-07ggerganov/llama.cpp b9060: b9060
<details open> sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET (#22149) * sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET Signed-off-by: Chun Tao <chun.tao@intel.com> * Fix abort during test-backend-ops Signed-off-by: Todd Malsbary <todd
github:ggerganov/llama.cpp - 2026-05-07ggerganov/llama.cpp b9058: b9058
<details open> llama : remove unnecessary seq_id check during state restore (#22797) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9058/llama-b9058-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, Kleid
github:ggerganov/llama.cpp - 2026-05-07ggerganov/llama.cpp b9057: b9057
<details open> ggml-cpu: Optimized risc-v cpu q1_0 dot </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9057/llama-b9057-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.c
github:ggerganov/llama.cpp - 2026-05-07ggerganov/llama.cpp b9056: b9056
<details open> mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT (#22770) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9056/llama-b9056-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (
github:ggerganov/llama.cpp - 2026-05-07ggerganov/llama.cpp b9050: b9050
<details open> llama : add missing call to ggml_backend_load_all() (#22752) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9050/llama-b9050-bin-macos-arm64
github:ggerganov/llama.cpp - NVDAgithub:NVIDIA/Megatron-LM
- 2026-05-06ggerganov/llama.cpp b9049: b9049
<details open> mtmd : support MiniCPM-V 4.6 (#22529) * Support MiniCPM-V 4.6 in new branch Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code bug Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix pre-commit Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix conve
github:ggerganov/llama.cpp - 2026-05-06ggerganov/llama.cpp b9048: b9048
<details open> model : don't crash on unsupported architecture (#22742) * model: don't crash on unsupported architecture * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret
github:ggerganov/llama.cpp - 2026-05-06ggerganov/llama.cpp b9047: b9047
<details open> common: do not fit to unknown device memory (#22614) * common: do not fit to unknown device memory Signed-off-by: Florian Reinle <f.reinle@otec.de> * common: preserve host fallback for non-GPU fit devices Signed-off-by: Florian Reinle <f.reinle@otec.de> * com
github:ggerganov/llama.cpp - 2026-05-06ggerganov/llama.cpp b9045: b9045
<details open> mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) (#22101) * mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stackin
github:ggerganov/llama.cpp - 2026-05-06ggerganov/llama.cpp b9041: b9041
<details open> ggml-cpu: fuse RMS_NORM + MUL on CPU backend (#22423) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9041/llama-b9041-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](h
github:ggerganov/llama.cpp - 2026-05-06ggerganov/llama.cpp b9038: b9038
<details open> ggml : use `CL_DEVICE_GLOBAL_MEM_SIZE` as memory estimate for OpenCL --fit (#22688) * ggml : report estimated OpenCL memory for --fit Signed-off-by: Florian Reinle <f.reinle@otec.de> * ggml : estimated OpenCL memory backend integrated Signed-off-by: Florian Re
github:ggerganov/llama.cpp - 2026-05-05ggerganov/llama.cpp b9037: b9037
<details open> Hexagon: Process M-tail rows on HMX instead of HVX (#22724) * hex-mm: process m-tail rows on HMX instead of HVX * hmx-mm: unroll and optimize padded activation loop --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> </details> **macOS/iOS:** -
github:ggerganov/llama.cpp # Release v5.8.0 ## New Model additions ### DeepSeek-V4 <img width="6604" height="3574" alt="image" src="https://github.com/user-attachments/assets/4c0fdb29-f770-463c-a97b-d24438896a4c" /> DeepSeek-V4 is the next-generation MoE (Mixture of Experts) language model fr
github:huggingface/transformers- 2026-05-05ggerganov/llama.cpp b9033: b9033
<details open> sync : ggml </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9033/llama-b9033-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releas
github:ggerganov/llama.cpp - 2026-05-05ggerganov/llama.cpp b9031: b9031
<details open> common : only load backends when required (#22290) * common : only load backends when required Signed-off-by: Adrien Gallouët <angt@huggingface.co> * llama : call ggml_backend_load_all() directly from llama_backend_init() Signed-off-by: Adrien Gallouët <angt@h
github:ggerganov/llama.cpp - 2026-05-05ggerganov/llama.cpp b9028: b9028
<details open> llama : add option to save memory in device buffers (#22679) * llama : add option to save memory in device buffers * tests : extend llama-save-load-state </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/d
github:ggerganov/llama.cpp - 2026-05-05ggerganov/llama.cpp b9026: b9026
<details open> ggml : implement fast walsh-hadamard transform for kv rotation (#21352) (#22631) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9026/llama-b9026-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (a
github:ggerganov/llama.cpp - 2026-05-04ROCm/ROCm rocm-7.2.3: ROCm 7.2.3 Release
<!-- Do not edit this file! --> <!-- This file is autogenerated with --> <!-- tools/autotag/tag_script.py --> <!-- Disable lints since this is an auto-generated file. --> <!-- markdownlint-di
AMDgithub:ROCm/ROCm