github
GitHub APIKeeps: repo, release, stars delta
- 2026-07-01ggerganov/llama.cpp b9859: b9859
<details open> opencl: allow loading precompiled binary kernels from library (#23042) * opencl: allow loading binary kernel * opencl: add libdl.h * ggml-backend-dl is in ggml, which depends backend libs, thus ggml-opencl cannot depend on ggml-backend-dl * add libdl.h to bre
github:ggerganov/llama.cpp - 2026-07-01ggerganov/llama.cpp b9858: b9858
<details open> common : use hf primary split as model path (#25194) Fixes #25181 </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9858/llama-b9858-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI
github:ggerganov/llama.cpp - 2026-07-01ggerganov/llama.cpp b9857: b9857
<details open> hexagon: flash attention rework (optimizations, accuracy improvements, etc) (#25085) * hex-mm: fold mm quant tasks into the main matmul threads * hex-mm: minor formatting fixes * hex-mm: cleanup is_quant checks in dma dispatch * hex-mm: fix dst-spad alignment
github:ggerganov/llama.cpp - 2026-07-01ggerganov/llama.cpp b9856: b9856
<details open> CUDA: consistent use of __restrict__ + PDL for FA (#25185) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9856/llama-b9856-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled)
github:ggerganov/llama.cpp - 2026-07-01ggerganov/llama.cpp b9855: b9855
<details open> ggml-cpu: add AVX2 optimization for nvfp4 dot product and use UE4M3 LUT (#23961) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9855/llama-b9855-bin-macos-arm64.tar.gz) - macOS Apple Silicon (ar
github:ggerganov/llama.cpp - 2026-07-01ggerganov/llama.cpp b9852: b9852
<details open> opencl: initial q1_0 support (#25160) * opencl: general q1_0 support * opencl: add Adreno GEMM/GEMV for q1_0 </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9852/llama-b9852-bin-macos-arm64.tar
github:ggerganov/llama.cpp - 2026-06-30ggerganov/llama.cpp b9851: b9851
<details open> cuda : prevent integer truncation and overflow errors when using KQ mask strides in flash_attn_mask_to_KV_max kernel (#24945) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/gg
github:ggerganov/llama.cpp - 2026-06-30ggerganov/llama.cpp b9850: b9850
<details open> model : register t_layer_inp for qwen3next (#25141) * Fix input assignment in layer processing loop Fix DFLASH for qwen-coder-next * add line break Added tensor for attention normalization in Qwen3 model. </details> **macOS/iOS:** - [macOS Apple Silicon (arm
github:ggerganov/llama.cpp - 2026-06-30ggerganov/llama.cpp b9849: b9849
<details open> common,server: handle bracketed IPv6 literals in URL authority (#25140) * common,server: handle bracketed IPv6 literals in URL authority Parse the [host]:port form (RFC 3986) and bracket IPv6 hosts when formatting a URL authority: listening log, proxy Host heade
github:ggerganov/llama.cpp - 2026-06-30ggerganov/llama.cpp b9848: b9848
<details open> CUDA: fix get_rows_back for tables with more than 65535 rows (grid-y clamp + stride) (#25103) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9848/llama-b9848-bin-macos-arm64.tar.gz) - macOS Appl
github:ggerganov/llama.cpp - 2026-06-30ggerganov/llama.cpp b9847: b9847
<details open> CUDA: fix Gemma E4B MTP FlashAttention (#25148) * CUDA: fix Gemma E4B MTP FlashAttention * remove unused template declaration </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9847/llama-b9847-bi
github:ggerganov/llama.cpp - 2026-06-30ggerganov/llama.cpp b9846: b9846
<details open> vulkan: roll bk loop in matmul for asahi linux (#24663) * vulkan: roll bk loop in matmul for asahi linux * vulkan: fix inline comment * vulkan: revert BK-loop unroll change * vulkan: edit spirv directly for asahi roll bk loop * vulkan: remove trailing whitesp
github:ggerganov/llama.cpp - 2026-06-30ggerganov/llama.cpp b9844: b9844
<details open> ggml-webgpu: add support for NVFP4 (#25143) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9844/llama-b9844-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](htt
github:ggerganov/llama.cpp - 2026-06-30ggerganov/llama.cpp b9843: b9843
<details open> Revert "sched : reintroduce less synchronizations during split compute (#20793)" (#25138) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9843/llama-b9843-bin-macos-arm64.tar.gz) - macOS Apple Si
github:ggerganov/llama.cpp - 2026-06-29vllm-project/vllm v0.24.0: v0.24.0
# vLLM v0.24.0 Release Notes ## Highlights This release features 571 commits from 256 contributors (77 new)! * **MiniMax-M3**: Added support for the new **MiniMax-M3** model (#45381), with a fast follow-on of BF16/FP8 indexer via MSA (#45892), MXFP4 support (#45896), FP8
github:vllm-project/vllm - 2026-06-29ggerganov/llama.cpp b9842: b9842
<details open> common : dedup preset and cached model entries in /v1/models (#25131) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9842/llama-b9842-bin-ma
github:ggerganov/llama.cpp - 2026-06-29ggerganov/llama.cpp b9840: b9840
<details open> DeepSeek V4 (#24162) * convert: add dsv4 conversion * add basic setup * add llm_graph_input_dsv4 * add save-load state * add sinkhorn eps - correction by @fairydreaming * add rope fix * cleanup dead code * fix bugs * support pro model: added by @fairydre
github:ggerganov/llama.cpp - 2026-06-29ggerganov/llama.cpp b9839: b9839
<details open> tools/ui: restore Tailwind scanning in ignored worktrees (#24879) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9839/llama-b9839-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI e
github:ggerganov/llama.cpp - 2026-06-29ggerganov/llama.cpp b9838: b9838
<details open> common : remove unused regex-partial (#25118) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9838/llama-b9838-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](h
github:ggerganov/llama.cpp - 2026-06-29ggerganov/llama.cpp b9837: b9837
<details open> jinja, chat: add --reasoning-preserve flag (#25105) * jinja, chat: add --reasoning-preserve flag * correct help message </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9837/llama-b9837-bin-maco
github:ggerganov/llama.cpp - 2026-06-28ggerganov/llama.cpp b9835: b9835
<details open> ui: fix stop and reasoning skip in single-model mode (#25084) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9835/llama-b9835-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabl
github:ggerganov/llama.cpp - 2026-06-28ggerganov/llama.cpp b9833: b9833
<details open> chat : implement minicpm5 parser (#24889) * Add minicpm5 tool call parser * Refactor MiniCPM5 PEG parser per review feedback * Fix jinja min/max API to match Jinja2 * modify by review * MiniCPM5: use autoparser for XML tool calls and fix grammar preserved-tok
github:ggerganov/llama.cpp - 2026-06-28ggerganov/llama.cpp b9832: b9832
<details open> jinja: add --dump-prog for debugging (#25086) * jinja: add --dump-prog for debugging * Update common/jinja/runtime.cpp Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com> --------- Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.nore
github:ggerganov/llama.cpp - 2026-06-28ggerganov/llama.cpp b9831: b9831
<details open> spec : add DFlash support (#22105) * spec: add DFlash v2 support * dflash: support sliding window attention per layer_types * docs: add dflash section --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> </details> **macOS/iOS:** - [macOS Apple S
github:ggerganov/llama.cpp - 2026-06-28ggerganov/llama.cpp b9830: b9830
<details open> common : allow --offline in llama download (#25091) Expose the existing --offline flag to `llama download` so a script can run it to check whether a model is already cached and ready to be served without touching the network. Also fix a latent use-after-free in
github:ggerganov/llama.cpp - 2026-06-28ggerganov/llama.cpp b9829: b9829
<details open> logs : reduce v2 (#25078) * server : reduce logs * cont : common * cont : spec * cont : CMN_ -> COM_ </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9829/llama-b9829-bin-macos-arm64.tar.gz) -
github:ggerganov/llama.cpp - 2026-06-27ggerganov/llama.cpp b9828: b9828
<details open> opencl: flash attention improvement (#25069) * opencl: rework FA kernel for f16 and f32 * opencl: flash-attention prefill prepass kernels - flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple - flash_attn_mask_pad_f16 pads the matching mask ti
github:ggerganov/llama.cpp - 2026-06-27ggerganov/llama.cpp b9827: b9827
<details open> [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057) * [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not
github:ggerganov/llama.cpp - 2026-06-27ggerganov/llama.cpp b9826: b9826
<details open> sycl : fix failed ut cases of norm (#25044) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9826/llama-b9826-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](htt
github:ggerganov/llama.cpp - 2026-06-27ggerganov/llama.cpp b9825: b9825
<details open> vulkan: fix step operator for 0 input (#25036) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9825/llama-b9825-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](
github:ggerganov/llama.cpp - 2026-06-27ggerganov/llama.cpp b9824: b9824
<details open> binaries : Improve rpc-server and export-graph-ops names. (#25045) Tests are generally prefixed with -test, so rename export-graph-ops accordingly. rpc-server is probably too generic a name for /usr/bin. Because it should work with any ggml application, it is re
github:ggerganov/llama.cpp - 2026-06-27ggerganov/llama.cpp b9823: b9823
<details open> ci : add windows-openvino to check-release (#25022) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9823/llama-b9823-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISAB
github:ggerganov/llama.cpp - 2026-06-27ggerganov/llama.cpp b9822: b9822
<details open> tests : fix test-chat-template --no-common option (#25075) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9822/llama-b9822-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled)
github:ggerganov/llama.cpp - 2026-06-26ggerganov/llama.cpp b9821: b9821
<details open> app : allow --version, --licenses & --help (#25054) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9821/llama-b9821-bin-macos-arm64.tar.gz)
github:ggerganov/llama.cpp <details><summary>Changelog Details</summary> - fix(ci): replace actions/setup-python with apt-get to avoid 429 rate limits by @ko3n1g :: PR: #4072 - ci: Fix package name for code-freeze workflow by @ko3n1g :: PR: #4077 - chore: bump `_code_freeze` workflow to `v0.86.0` by @k
NVDAgithub:NVIDIA/Megatron-LM- 2026-06-21NVIDIA/cutlass v4.2.2: CUTLASS 4.2.2
### CUTLASS C++ * Make [version.h](https://github.com/NVIDIA/cutlass/blob/release/4.2/include/cutlass/version.h) NVRTC JIT compilation compatible. * Allow linking large cutlass library on 64bit platform. * Fix alignment-related miscalculation for pipeline stages of Blackwell b
NVDAgithub:NVIDIA/cutlass - 2026-06-06ggerganov/llama.cpp b9544: b9544
<details open> common/chat : fix LFM2/LFM2.5 reasoning round-trip and <think> leak (#24234) * common/chat : fix LFM2 reasoning round-trip and stray <think> leak * Gate by reasoning format and whether the template supports <think> </details> **macOS/iOS:** - [macOS Apple Silic
github:ggerganov/llama.cpp - 2026-06-06ggerganov/llama.cpp b9543: b9543
<details open> mtmd: support "frame merge" for qwen-vl-based models (#21858) * feat: add video support for Qwen3.5 * various clean up * revise the design * fix llava-uhd case * nits * nits 2 --------- Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com>
github:ggerganov/llama.cpp - 2026-06-06ggerganov/llama.cpp b9542: b9542
<details open> completion : remove useless statics (#24226) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9542/llama-b9542-bin-macos-arm64.tar.gz) - macOS
github:ggerganov/llama.cpp - 2026-06-06ggerganov/llama.cpp b9541: b9541
<details open> completion : fix format specifier in LOG_INF (#24213) Signed-off-by: Adrien Gallouët <angt@huggingface.co> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9541/llama-b9541-bin-macos-arm64.tar.gz
github:ggerganov/llama.cpp - 2026-06-06ggerganov/llama.cpp b9538: b9538
<details open> model : rename local n_layer_all variable (#24209) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9538/llama-b9538-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABL
github:ggerganov/llama.cpp - 2026-06-06ggerganov/llama.cpp b9537: b9537
<details open> context : fix off-by-one comparisons to n_gpu_layers (#24208) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9537/llama-b9537-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabl
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9536: b9536
<details open> opencl: improve get_rows, cpy, concat and q6_k flat gemv (#24160) * opencl: allow multiple workgroups for large rows * opencl: improve small cpy * opencl: packed concat for small input * opencl: tweak flat q6_K gemv, increase N_DST and remap threads </details
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9535: b9535
<details open> common/chat : unify and fix LFM2/LFM2.5 tool parser (#24178) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9535/llama-b9535-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enable
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9534: b9534
<details open> vulkan: add fwht support for Intel with shmem reduction (#23964) * vulkan: add fwht support for Intel with shmem reduction * don't use N as workgroup size * disable subgroup shuffle on MoltenVK AMD * disable fwht shader on Intel Windows due to driver bug </de
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9533: b9533
<details open> model: fix build failed (#24193) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9533/llama-b9533-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](https://github
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9531: b9531
<details open> TP: round up granularity to 128 (#24180) * TP: round up granularity to 128 * remove assert </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9531/llama-b9531-bin-macos-arm64.tar.gz) - macOS Apple
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9530: b9530
<details open> cli: fix model params not propagated (#23893) Fixes #23847 </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9530/llama-b9530-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9529: b9529
<details open> model : fix llama_model::n_gpu_layers() (#24188) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9529/llama-b9529-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED
github:ggerganov/llama.cpp - 2026-06-05ggerganov/llama.cpp b9528: b9528
<details open> ui: run npm install when package-lock.json is newer than node_modules (#24171) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9528/llama-b9528-bin-macos-arm64.tar.gz) - macOS Apple Silicon (arm6
github:ggerganov/llama.cpp