A developer report on the LocalLLaMA subreddit (source) indicates that on Jetson AGX Orin 64GB, running...
A developer report on the LocalLLaMA subreddit (source) indicates that on Jetson AGX Orin 64GB, running Qwen3.6-27B-MTP-GGUF with llama.cpp yields lower prefill performance for q6k (190 pp) compared to q80 (245 pp) and even q4kxl (210 pp).
signal brief
A developer report on the LocalLLaMA subreddit (source) indicates that on Jetson AGX Orin 64GB, running Qwen3.6-27B-MTP-GGUF with llama.cpp yields lower prefill performance for q6_k (190 pp) compared to q8_0 (245 pp) and even q4_k_xl (210 pp). The user notes EMC is not saturated, suggesting the issue may be suboptimal CUDA kernel optimization for lower quantizations on this platform. This observation, while single-source and limited to one model, hints at a potential developer experience issue where lower-bit quantizations, which should be faster, actually underperform. If confirmed, this could indicate a drift in CUDA's optimization for edge inference, affecting developer trust and adoption of Jetson platforms for LLM workloads. Given the low confidence and single anecdote, the signal is weak but points to a potential negative trend in CUDA's edge ecosystem.
evidence
- https://www.tomshardware.com/desktops/gaming-pcs/save-usd1-280-on-this-4k-ready-alienware-aurora-desktop-pc-with-rtx-5080-high-performance-gaming-at-your-fingertips-for-usd2-919web
- https://blogs.nvidia.com/blog/cvpr-research-grasping-driving-agent-training/web
- https://www.hpcwire.com/off-the-wire/anyon-and-q-ctrl-bring-self-calibrating-quantum-systems-to-enterprise-data-centers/web
- https://www.reddit.com/r/LocalLLaMA/comments/1twgwrf/jetson_agx_orin_64gb_q8_0_good_q6_k_bad/reddit
- https://developer.nvidia.com/cuda-toolkitweb
- https://manifold.markets/_deleted_/will-cuda-remain-a-monopoly-for-gpuweb
Decision support, not stock advice. This signal is research with cited evidence — not a recommendation to buy, sell, or hold any security.