github
GitHub APIKeeps: repo, release, stars delta
- 2026-04-28ggerganov/llama.cpp b8956: b8956
<details open> CANN: add new ops, optimize existing ops (#21204) New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInp
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8955: b8955
<details open> spec : refactor params (#22397) * spec : refactor params * cont : fix * cont : rename "sparam" to "sampling" * cont : add spec params category * cont : add info about removed arguments * cont : skip param length check for spec params * cont : adapt server t
github:ggerganov/llama.cpp - 2026-04-28ggerganov/llama.cpp b8953: b8953
<details open> ggml-webgpu: add Q1_0 support (#22374) * add fast matmul matvec q1_0 kernel * ggml-webgpu: drop redundant zero-fills in Q1_0 shmem init </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8953/llam
github:ggerganov/llama.cpp - 2026-04-27ggerganov/llama.cpp b8952: b8952
<details open> server: (router) Forward form-data to model server (Fixes #22044) (#22118) * This commit enables the router to forward form-data to model server. Fixes #22044 (enabling to use the /v1/audio/transcriptions in router mode) * * Applied the suggestion from Copilots
github:ggerganov/llama.cpp - 2026-04-27vllm-project/vllm v0.20.0: v0.20.0
# vLLM v0.20.0 ## Highlights This release features 752 commits from 320 contributors (123 new)! * **DeepSeek V4**: Initial DeepSeek V4 support landed (#40860), with DSML token-leakage fix in DSV4/3.2 (#40806), DSA + MTP IMA fix (#40772), and a silu clamp limit on the share
github:vllm-project/vllm - 2026-04-27ggerganov/llama.cpp b8951: b8951
<details open> add fast mat-vec kernels for i-quants (#22344) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8951/llama-b8951-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://g
github:ggerganov/llama.cpp - 2026-04-27ggerganov/llama.cpp b8950: b8950
<details open> Additional test for common/gemma4 : handle parsing edge cases (#22420) * Additional test for common/gemma4 : handle parsing edge cases * Move tests to Gemma 4 test group </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llam
github:ggerganov/llama.cpp - 2026-04-27ggerganov/llama.cpp b8949: b8949
<details open> fix: rpc-server cache may not work in Windows environments (#22394) * fix: create directory and log cache file name. * Remove GGML_LOG_INFO conditional compilation. --------- Co-authored-by: kotaro <kotaro.kusunoki@gmail.com> </details> **macOS/iOS:** - [mac
github:ggerganov/llama.cpp - 2026-04-27ggerganov/llama.cpp b8948: b8948
<details open> Fix type casting for unaccounted memory calculation (#22424) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8948/llama-b8948-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabl
github:ggerganov/llama.cpp - 2026-04-27ggerganov/llama.cpp b8947: b8947
<details open> download : prefer q8_0 when q4_k not available (#22428) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8947/llama-b8947-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](
github:ggerganov/llama.cpp - 2026-04-27ggerganov/llama.cpp b8946: b8946
<details open> model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) (#22421) Signed-off-by: Yash Nankani <ynankani@nvidia.com> </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8946/llama-b8946-bi
github:ggerganov/llama.cpp - 2026-04-27ggerganov/llama.cpp b8944: b8944
<details open> ggml : use 64 bytes aligned tile buffers (#21058) | Model | Test | t/s OLD | t/s NEW | Speedup | |:---------------------------------|:-------|----------:|----------:|----------:| | qwen35 0.8B BF16 | pp512 |
github:ggerganov/llama.cpp - 2026-04-27ggerganov/llama.cpp b8943: b8943
<details open> common: fix missing exports in llama-common (#22340) * common: refactor common/debug to move abort_on_nan into base_callback_data Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO. It should jus
github:ggerganov/llama.cpp - 2026-04-26ggerganov/llama.cpp b8941: b8941
<details open> add performance-portable tuning for register-tile and subgroup matmul (#22241) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8941/llama-b8941-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm
github:ggerganov/llama.cpp - 2026-04-26ggerganov/llama.cpp b8940: b8940
<details open> Fix recurrent state serialization for partial reads and writes (#22362) The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server. </details> **macOS/iOS:** -
github:ggerganov/llama.cpp - 2026-04-26ggerganov/llama.cpp b8937: b8937
<details open> ggml-cpu : re-enable fast gelu_quick_f16 (#22339) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8937/llama-b8937-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https:
github:ggerganov/llama.cpp - 2026-04-26ggerganov/llama.cpp b8936: b8936
<details open> ggml-cpu: optimize avx2 q6_k (#22345) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8936/llama-b8936-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com
github:ggerganov/llama.cpp - 2026-04-26ggerganov/llama.cpp b8935: b8935
<details open> opencl: add iq4_nl support (#22272) * opencl: add general support for iq4_nl * opencl: add iq4_nl gemm/gemv for adreno * opencl: pack 2 lut entries into a uint </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/rel
github:ggerganov/llama.cpp - 2026-04-26ggerganov/llama.cpp b8934: b8934
<details open> hexagon: guard HMX clock request for v75+ platforms (#22377) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8934/llama-b8934-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabl
github:ggerganov/llama.cpp - 2026-04-25ggerganov/llama.cpp b8933: b8933
<details open> chat: fix handling of space in reasoning markers (#22353) * chat: fix handling of space in reasoning markers * fix tests * whitespace </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8933/llama
github:ggerganov/llama.cpp - 2026-04-25ggerganov/llama.cpp b8931: b8931
<details open> CUDA: reduce MMQ stream-k overhead (#22298) * CUDA: reduce MMQ stream-k overhead * use 32 bit integers for kbc </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8931/llama-b8931-bin-macos-arm64.t
github:ggerganov/llama.cpp - 2026-04-25ggerganov/llama.cpp b8929: b8929
<details open> llama-quant : default ftype param `Q5_1` --> `Q8_0` (#20828) Change the default `ftype` in `llama_model_quantize_params` from `LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`. In case some external program naively uses the default quantization params, we s
github:ggerganov/llama.cpp - 2026-04-25ggerganov/llama.cpp b8927: b8927
<details open> [SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (#22291) * opt arc770 for Q4_0 * add for Q4_0 * update the script * add help script for windows * update guide * fix format issue * convert from dos to unix for format issue * fix missed -sm parameter <
github:ggerganov/llama.cpp - 2026-04-25ggerganov/llama.cpp b8926: b8926
<details open> ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (#22327) * Implement ssm_scan * Remove blocking in graph_compute and check for set rows * Fix bindings * Update op support </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https:/
github:ggerganov/llama.cpp - 2026-04-24ggerganov/llama.cpp b8925: b8925
<details open> parser: fix structured output bug (#22302) * fix very stupid structured output bug * Things just cannot be too easy. </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8925/llama-b8925-bin-macos-a
github:ggerganov/llama.cpp - 2026-04-24ggerganov/llama.cpp b8924: b8924
<details open> Hexagon: Bump HMX Frequency to Max Corner (#22334) * hexagon: bump HMX freq to max corner * hex-mm: fix error in log msg </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8924/llama-b8924-bin-mac
github:ggerganov/llama.cpp - 2026-04-24ggerganov/llama.cpp b8922: b8922
<details open> ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (#22199) * ggml-webgpu: add tile flash attention fallback * ggml-webgpu: add new fields and discard usage of mnk for tile version * ggml-webgpu: modify the vec path to discard the mnk parame
github:ggerganov/llama.cpp - 2026-04-24ggerganov/llama.cpp b8920: b8920
<details open> metal : print GPU description (#22318) </details> **macOS/iOS:** - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b8920/llama-b8920-bin-macos-arm64.tar.gz) - [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.co
github:ggerganov/llama.cpp # Patch release v5.6.2 Qwen 3.5 and 3.6 MoE (text-only) were broken when using with FP8. It should now work again with this :saluting_face: * Fix configuration reading and error handling for kernels (https://github.com/huggingface/transformers/pull/45610) by @hmellor *
github:huggingface/transformers- NVDAgithub:NVIDIA/Megatron-LM
# Patch release v5.6.1 Flash attention path was broken! Sorry everyone for this one 🤗 * Fix AttributeError on s_aux=None in flash_attention_forward (https://github.com/huggingface/transformers/pull/45589) by @jamesbraza
github:huggingface/transformers# Release v5.6.0 ## New Model additions ### OpenAI Privacy Filter OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workf
github:huggingface/transformers- 2026-04-20NVIDIA/TensorRT-LLM v1.2.1: v1.2.1
## Highlights - **Fixed Issue** - Fixed an issue that caused KV cache corruption (#12770) - **Infrastructure Changes** - Upgraded xgrammar and flashinfer (#12811)
NVDAgithub:NVIDIA/TensorRT-LLM