Optimize memcmp for Kunpeng 950 with SVE

This patch optimizes memcmp for the Kunpeng 950 using SVE, resulting in 15% - 50% speedups for small to large inputs.

By Weihong Ye <[email protected]> April 17, 2026 Sentiment 7 / 10

The patch optimizes memcmp for the Kunpeng 950 by using SVE predication, 4-way loop unrolling, and optimized mismatch detection. It uses cntb instead of rdvl and mul vl addressing based on review feedback. Benchmarks show significant speedups, but regressions may occur near 4K boundaries.

Technical Tradeoffs

Uses SVE predication for branch-free handling of short inputs and tails.
Implements 4-way loop unrolling to maximize pipeline utilization.
Optimizes mismatch detection with early exit logic.
Potential regressions in edge cases where offsets are near 4K boundaries.

Filed Under: aarch64memcmpSVEoptimizationkunpeng950

View Original Thread →