github
GitHub APIKeeps: repo, release, stars delta
- 2026-05-17ggerganov/llama.cpp b9196: b9196
<details open> vulkan: Support unaligned tensors for ROPE (#22637) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9196/llama-b9196-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](http
AAPLgithub:ggerganov/llama.cpp - 2026-05-17ggerganov/llama.cpp b9194: b9194
<details open> vulkan: fuse SSM_CONV + BIAS + SILU (#22653) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9194/llama-b9194-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://git
AAPLgithub:ggerganov/llama.cpp - 2026-05-16ggerganov/llama.cpp b9190: b9190
<details open> server: (router) alloc tmp buffer on heap (#23159) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9190/llama-b9190-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https
AAPLgithub:ggerganov/llama.cpp - 2026-05-16ggerganov/llama.cpp b9189: b9189
<details open> server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9189/llama-b9189-bin-macos-arm64.tar.gz) - [macOS Ap
AAPLgithub:ggerganov/llama.cpp - 2026-05-16ggerganov/llama.cpp b9186: b9186
<details open> sync : ggml </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9186/llama-b9186-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releas
AAPLgithub:ggerganov/llama.cpp - 2026-05-16ggerganov/llama.cpp b9181: b9181
<details open> vendor : update cpp-httplib to 0.45.0 (#23103) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9181/llama-b9181-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://g
AAPLgithub:ggerganov/llama.cpp - 2026-05-16ggerganov/llama.cpp b9180: b9180
<details open> llama + spec: MTP Support (#22673) * spec: support MTP * fix batch size * rename files * cont : simplify (#7) * MTP: clean-up (#9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review:
AAPLgithub:ggerganov/llama.cpp - 2026-05-16ggerganov/llama.cpp b9174: b9174
<details open> ui: Restructure repo to use `tools/ui` folder and `ui` / `UI` / `llama-ui` / `LLAMA_UI` naming (#23064) * webui: Move static build output from `tools/server/public` to `build/ui` directory * refactor: Move to `tools/ui` * refactor: rename CMake variables and pr
AAPLgithub:ggerganov/llama.cpp - 2026-05-15ggerganov/llama.cpp b9173: b9173
<details open> ci : fix release symlinks (#23119) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9173/llama-b9173-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/gg
AAPLgithub:ggerganov/llama.cpp - 2026-05-15ggerganov/llama.cpp b9172: b9172
<details open> webui: Use lowercase hash for HF checksum check (#23107) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9172/llama-b9172-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)]
AAPLgithub:ggerganov/llama.cpp - 2026-05-15ggerganov/llama.cpp b9169: b9169
<details open> mtmd: add chunks and fix preproc for qwen3a (#23073) * mtmd: add chunks and fix preproc for qwen3a * add attn_mask * limit mtmd_chunk size (avoid blow up memory) * correct audio tokens * re-order the set_input case * remove attn_mask </details> **macOS/iOS
AAPLgithub:ggerganov/llama.cpp - 2026-05-15ggerganov/llama.cpp b9165: b9165
<details open> ci : fix transform of top . entry in release archive (#23080) * fix transform of top . entry in release archive * simplify </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9165/llama-b9165-bin-m
github:ggerganov/llama.cpp - 2026-05-15ggerganov/llama.cpp b9163: b9163
<details open> reasoning-budget: clone should do a deep-copy (#23095) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9163/llama-b9163-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](h
github:ggerganov/llama.cpp - 2026-05-15ggerganov/llama.cpp b9161: b9161
<details open> Support for Codex CLI by skipping unsupported Responses tools (#23041) * Support for Codex CLI by skipping unsupported Responses tools * Warn on skipped Responses tools and preserve gpt-oss apply_patch rejection * Revert gpt-oss apply_patch special handling </
github:ggerganov/llama.cpp - 2026-05-15vllm-project/vllm v0.21.0: v0.21.0
## Highlights This release features 367 commits from 202 contributors (49 new)! * **Transformers v4 deprecated**: This release formally deprecates `transformers` v4 support (#40389). Users should migrate to `transformers` v5. * **C++20 build requirement**: vLLM now require
github:vllm-project/vllm - 2026-05-15ggerganov/llama.cpp b9159: b9159
<details open> ggml-hexagon: cpy: add contiguous fast-path in reshape copy (#23076) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9159/llama-b9159-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, Kleidi
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9158: b9158
<details open> HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (#22880) Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9156: b9156
<details open> ggml-webgpu: Enable NVIDIA self-hosted CI (#22976) * Enabel nvidia ci for webgpu * Address precision issues * fix placement * Relax more set_rows and div * Try relaxing all f16 * formatting and naming * Add comment explaining max_nmse_err logic Added comme
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9151: b9151
<details open> logs : reduce (#23021) * logs : reduce * args : fix envs * server : fix build * common : print verbosity level at start * server : clean-up logs * server : print prompt processing timings + sampling params * minor : whitespaces </details> **macOS/iOS:** -
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9150: b9150
<details open> ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (#22863) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9150/llama-b9150-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, Kl
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9148: b9148
<details open> unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110) * unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests - Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtrackin
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9145: b9145
<details open> SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (#21597) * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9144: b9144
<details open> ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (#23020) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9144/llama-b9144-bin-macos-arm64.tar.gz) -
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9143: b9143
<details open> Fix for issue #22974. Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (#22994) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/r
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9142: b9142
<details open> opencl: add q5_0 and q5_1 MoE for Adreno (#22985) * opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcom
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9141: b9141
<details open> server, webui: accept continue_final_message flag for vLLM API compat (#23012) * server, webui: accept continue_final_message flag for vLLM API compat Add the continue_final_message body flag from the vLLM and transformers API. When set together with add_generat
github:ggerganov/llama.cpp - 2026-05-14ggerganov/llama.cpp b9140: b9140
<details open> opencl: fix crash when warming up MoE on Adreno (#22876) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9140/llama-b9140-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)]
github:ggerganov/llama.cpp - 2026-05-13ggerganov/llama.cpp b9139: b9139
<details open> flush the gpu profile timestamp before the queryset is overflowed (#22995) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9139/llama-b9139-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64,
github:ggerganov/llama.cpp - 2026-05-13ggerganov/llama.cpp b9134: b9134
<details open> download: do not exit() on error (#23008) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9134/llama-b9134-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github
github:ggerganov/llama.cpp # PyTorch 2.12.0 Release Notes - [Highlights](#highlights) - [Backwards Incompatible Changes](#backwards-incompatible-changes) - [Deprecations](#deprecations) - [New Features](#new-features) - [Improvements](#improvements) - [Bug fixes](#bug-fixes) - [Performance](#perfo
github:pytorch/pytorch- 2026-05-13ggerganov/llama.cpp b9133: b9133
<details open> server, webui: support continue generation on reasoning models (#22727) * server, webui : support continue generation on reasoning models (#22727) Remove the throw blocking assistant prefill on reasoning models and orchestrate thinking tags around the prefilled
github:ggerganov/llama.cpp - 2026-05-13ggerganov/llama.cpp b9131: b9131
<details open> spec : update CLI arguments for better consistency (#22964) * spec : update CLI arguments for better consistency * cont : fix CLI arg message </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b913
github:ggerganov/llama.cpp - 2026-05-13NVIDIA/cutlass v4.5.0: CUTLASS 4.5.0
## CuTe DSL * New features - New Block API `block_copy()` to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA by `block_copy()` and need not to invoke `tma_partition()`. And users can remove bulk of S2T initialization to simplify S
NVDAgithub:NVIDIA/cutlass # Patch release v5.8.1 This release is mainly to fix the Deepseek V4 integration!!! <img width="714" height="774" alt="image" src="https://github.com/user-attachments/assets/0d85e891-a0ff-436e-a9d4-b6633096f2b5" /> * [fix] Add fatal_error to ContinuousBatchingManager s
github:huggingface/transformers- 2026-05-13ggerganov/llama.cpp b9128: b9128
<details open> hexagon: eliminate scalar VTCM loads via HVX splat helpers (#22993) * hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase * hmx-mm: optimize per-group scale handling * hmx-fa: optimize slope load from vtcm * hmx-fa: use aligned access w
github:ggerganov/llama.cpp - 2026-05-12ggerganov/llama.cpp b9127: b9127
<details open> opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (#22755) * ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill * ggml-opencl: address Adreno xmem review comments * ggml-opencl: align xmem gemm kernel naming --------- Co-authored-by: Your Name <your@
github:ggerganov/llama.cpp - 2026-05-12ggerganov/llama.cpp b9124: b9124
<details open> mtmd, server, common: expose modalities to /v1/models (#22952) * mtmd, server, common: expose modalities to /v1/models * fix build * rename to mtmd_caps </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/d
github:ggerganov/llama.cpp - 2026-05-12ggerganov/llama.cpp b9123: b9123
<details open> ggml-webgpu: Enables running gpt-oss-20b (#22906) * Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download
github:ggerganov/llama.cpp - 2026-05-12ggerganov/llama.cpp b9122: b9122
<details open> ggml-webgpu: address precision issues for multimodal (#22808) * fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 * fix(unary): correct the gelu, gelu quick and gelu erf functions * fix(flash-attn-tile): fix the har
github:ggerganov/llama.cpp - 2026-05-12ggerganov/llama.cpp b9119: b9119
<details open> vulkan: Fix Windows performance regression on Intel GPU BF16 workloads for Xe2 and newer (#22461) * refactor * Use l_warptile only when coopamt is available for BF16 </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cp
github:ggerganov/llama.cpp - 2026-05-12ggerganov/llama.cpp b9118: b9118
<details open> vulkan: Check shared memory size for mmq shaders (#22693) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9118/llama-b9118-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)
github:ggerganov/llama.cpp - 2026-05-12ggerganov/llama.cpp b9116: b9116
<details open> mtmd: add MiMo v2.5 vision (#22883) * mimo-v2.5: vision support * mimo-v2.5: use fused qkv for vision * mimi-v2.5: fix f16 vision overflow * mimo-v2.5: comment cleanups * mimo-v2.5: Flash doesn't have mmproj more cleanup remember to use filter_tensors * mimo
github:ggerganov/llama.cpp - 2026-05-12ggerganov/llama.cpp b9114: b9114
<details open> metal : promote mul_mv/mul_mm batch divisors to function constants (#22711) * metal : promote mul_mv/mul_mm batch divisors to function constants * metal : take op directly in get_pipeline_mul_mv_ext </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](htt
github:ggerganov/llama.cpp - 2026-05-11ggerganov/llama.cpp b9113: b9113
<details open> opencl: add q4_1 MoE for Adreno (#22856) * Q4_1 MoE CLC pass sanity check * remove unnecessary code * opencl: remove unnecessary asserts and reformat * opencl: fix supports_op for q4_1 moe * q4_1 moe is supported by Adreno with certain shapes --------- Co-a
github:ggerganov/llama.cpp - 2026-05-11ggerganov/llama.cpp b9112: b9112
<details open> CUDA: handle OW > 65535 in im2col (2D and 3D) (#22944) `im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet at 11 s lands at
github:ggerganov/llama.cpp - 2026-05-11ggerganov/llama.cpp b9110: b9110
<details open> docs: fix metrics endpoint description in server README (#22879) * docs: fix metrics endpoint description in server README Required model query parameter for router mode described. Removed metrics: - llamacpp:kv_cache_usage_ratio - llamacpp:kv_cache_tokens Add
github:ggerganov/llama.cpp - 2026-05-11ggerganov/llama.cpp b9109: b9109
<details open> spec : parallel drafting support (#22838) * spec : refactor * spec : drop support for incompatible vocabs * spec : update common_speculative_init() * cont : pass seq_id * cont : dedup ctx_seq_rm_type * server : sketch the ctx_dft decode loop * server : draf
github:ggerganov/llama.cpp - 2026-05-11ggerganov/llama.cpp b9106: b9106
<details open> vulkan: Support asymmetric FA in scalar/mmq/coopmat1 paths (#22589) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9106/llama-b9106-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiA
github:ggerganov/llama.cpp - 2026-05-11ggerganov/llama.cpp b9105: b9105
<details open> CUDA: directly include cuda/iterator (#22936) Before, we relied on a transient import from `cub/cub.cuh`, which is bad practice to do as cub may not always expose cuda/iterator </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-or
github:ggerganov/llama.cpp - 2026-05-11ggerganov/llama.cpp b9103: b9103
<details open> vendor : update cpp-httplib to 0.44.0 (#22919) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9103/llama-b9103-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://g
github:ggerganov/llama.cpp