Qualcomm AI Engine Direct - Support LPAI Backend#18151
Qualcomm AI Engine Direct - Support LPAI Backend#18151abhinaykukkadapu merged 8 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18151
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 7 PendingAs of commit 25c600e with merge base 71334aa ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
ff3c249 to
550ec45
Compare
eaf3e9b to
c7e6b61
Compare
|
Hi @cccclai, @abhinaykukkadapu, This PR is to support LAPI Backend on x86 and arm (with RPC mode). After LPAI backend is supported, it should be easy to extend non-RPC support for aDSP. Thanks, |
| - SXR2330P | ||
| - QCM6490 | ||
| - QCS9100 | ||
| You will need an Android / Linux device with adb-connected running on one of Qualcomm SoCs listed in `QcomChipset`. Please navigate to [qc_schema.py](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/serialization/qc_schema.py). |
| ], | ||
| QnnConstants.OpConv2d.op_name, | ||
| ) | ||
| class Conv2d(GeneralOpDef): |
There was a problem hiding this comment.
Curious, if we are going to follow up validation for the LPAI op def in following PRs?
There was a problem hiding this comment.
Yes, once QNN 2.45 is available, validation will be enabled in QNN and we will test using this version.
| pl = len(partitions) | ||
| if backend == QnnExecuTorchBackendType.kLpaiBackend: | ||
| assert ( | ||
| pl != 1 |
There was a problem hiding this comment.
If we are enforcing full partition, shouldn't this be pl == 1?
There was a problem hiding this comment.
Woops, you are right. Let me fix it. Thanks for pointing out.
|
@shewu-quic thanks for adding the LPAI backend, can you please rebase? |
|
|
||
| The compiled model is `./deeplab_v3/dlv3_qnn.pte`. | ||
|
|
||
| Note that the model is compiled for specific backend (e.g., HTP), so you can specify the target backend via `--backend gpu` or `--backend lpai`. If not specified, it will be default to HTP. |
There was a problem hiding this comment.
Noob question, does the delegate in current state support multiple QNN backends in a single graph, for example, can we mark some nodes for GPU and others for HTP?
There was a problem hiding this comment.
We don't currently have a use case for this, but I think we can run some tests for this scenario. Thanks!
May I know do you have any use case for it?
There was a problem hiding this comment.
Not really, but i wanted to flush if there are any issues and we can advertise stronger support just staying within QNN ecosystem, not sure if we have this but i remember QnnMem.h was one direction to share memory between backends, so it would be great and was curious to know if we support, if a usecase appears.
There was a problem hiding this comment.
This is not supported today but can be in the future once a couple of other in progress features come in. We can discuss this at some point in the 2nd half of the year.
|
Hi @abhinaykukkadapu, I was wondering—since you uploaded some files related to claude, do you have any plans to use Claude for development or maintenance? |
Hi @shewu-quic, thanks these are some test skills we are experimenting across Executorch, but this brings an important direction, which i wanted to discuss in tomorrow's meeting, we definitely would like to have the et-qnn related skills. Some skills i'm thinking:
let's chat more tomorrow in the sync. |
Sure, I think that would be an interesting topic to include. If you can, please share your experience with how you plan to contribute to and manage the Claude file. Also, what are the current use cases in the executorch repo in tomorrow's meeting? Thanks! |
e4a4677 to
a404cdd
Compare
|
@shewu-quic can you fix the lints, i will import and check the internal CI as this is a large diff, just to be on the safe side. |
Could you let me know if you recommend having a separate SKILL.md for each skill, or should they all be combined into one SKILL.md? |
Fixed. Thanks! |
|
@abhinaykukkadapu has imported this pull request. If you are a Meta employee, you can view this in D99040563. |
|
@shewu-quic there are few internal build failures, as discussed yesterday i'm adding the CI to catch buck and cmake drifts like these. Here is the patch for you. |
Sure, I have applied it. Thank you. |
| from typing import Callable, Dict, List, Optional, Sequence, Set, Tuple | ||
|
|
||
| # To support quantize op lowering in AOT | ||
| import executorch.extension.pybindings.portable_lib # noqa[F401] |
There was a problem hiding this comment.
@shewu-quic i think this doesn't belong here, i'm trying to debug few other failures for the internal CI after the patch, this is for python runtime and not for backends?
There was a problem hiding this comment.
If I didn't import this, I will encounter the error
Reproduce command: python3 backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_linear -b build-android -s {device}-m SM8850 -a unit_test --backend lpai
[INFO 2026-04-07 09:36:21,233 __init__.py:390] Failed converting '<EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int
quant_min, int quant_max, ScalarType dtype) -> Tensor' to its out variant with error: 'SchemaKind.out variant of operator quantized_decomposed::quantize_per_tensor can't be found. We've found the schemas of all the overloads: ['qu
antized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor', 'quantized_decomposed::quantize_per_tensor.tensor(Tensor input, Tensor scale, Tensor zer
o_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor', 'quantized_decomposed::quantize_per_tensor.tensor2(Tensor input, Tensor scale, Tensor zero_point, Tensor quant_min, Tensor quant_max, ScalarType dtype) -> Tensor'
]'
[INFO 2026-04-07 09:36:21,233 __init__.py:390] Failed converting '<EdgeOpOverload: quantized_decomposed.dequantize_per_tensor.default>: schema = quantized_decomposed::dequantize_per_tensor(Tensor input, float scale, int zero_point,
int quant_min, int quant_max, ScalarType dtype, *, ScalarType? out_dtype=None) -> Tensor' to its out variant with error: 'SchemaKind.out variant of operator quantized_decomposed::dequantize_per_tensor can't be found. We've found t
he schemas of all the overloads: ['quantized_decomposed::dequantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype, *, ScalarType? out_dtype=None) -> Tensor', 'quantized_decompo
sed::dequantize_per_tensor.tensor(Tensor input, Tensor scale, Tensor zero_point, int quant_min, int quant_max, ScalarType dtype, *, ScalarType? out_dtype=None) -> Tensor', 'quantized_decomposed::dequantize_per_tensor.tensor2(Tensor
input, Tensor scale, Tensor zero_point, Tensor quant_min, Tensor quant_max, ScalarType dtype, *, ScalarType? out_dtype=None) -> Tensor']'
======================================================================
ERROR: test_qnn_backend_linear (__main__.TestQNNQuantizedOperator) (i=0)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/local/mnt/workspace/shewu/executorch/backends/qualcomm/tests/test_qnn_delegate.py", line 3756, in test_qnn_backend_linear
self.lower_module_and_test_output(module, sample_input)
File "/local/mnt/workspace/shewu/executorch/backends/qualcomm/tests/utils.py", line 592, in lower_module_and_test_output
exec_prog = delegated_program.to_executorch(
File "/local/mnt/workspace/shewu/executorch/exir/program/_program.py", line 116, in wrapper
return func(*args, **kwargs)
File "/local/mnt/workspace/shewu/executorch/exir/program/_program.py", line 1751, in to_executorch
new_gm_res = p(new_gm)
File "/local2/mnt/workspace/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/fx/passes/infra/pass_base.py", line 47, in __call__
res = self.call(graph_module)
File "/local/mnt/workspace/shewu/executorch/exir/passes/__init__.py", line 440, in call
raise RuntimeError(f"Missing out variants: {missing_out_vars}")
RuntimeError: Missing out variants: {'quantized_decomposed::dequantize_per_tensor', 'quantized_decomposed::quantize_per_tensor'}
There was a problem hiding this comment.
It looks like the quantized kernel will be missing if I don't import it. Are you thinking about moving this import here?
There was a problem hiding this comment.
Or we wrap it to try-except block? Such as
There was a problem hiding this comment.
The problem is portable_lib dependency is huge and it has its registration mechanism, if we do it here, many internal tests are failing due to double registration.
There was a problem hiding this comment.
I think import executorch.kernels.quantized is just sufficient? as it registers the required aot quant library: https://github.com/pytorch/executorch/blob/main/kernels/quantized/__init__.py
I think you also need to enable the cmake target -DEXECUTORCH_BUILD_KERNELS_QUANTIZED_AOT=ON
Update the buck target for quantizer from portable_lib to
deps = [
"//executorch/backends/transforms:decompose_sdpa",
"//executorch/kernels/quantized:aot_lib",
"//executorch/kernels/quantized:quantized_ops_lib",
],
|
@claude Review this PR, make sure buck files are updated properly, there are no test in the CI which may signal buck issues. |
|
Claude finished @digantdesai's task in 4m 51s —— View job Review: Buck/Build File Analysis for LPAI Backend PR
Buck Files StatusThe buck
|
| CMake target | LPAI files included | Status |
|---|---|---|
qnn_device |
LpaiDevice.h, ${HOST_ARCHITECTURE_LPAI}/LpaiDevice.cpp |
✅ |
qnn_context |
LpaiContext.h, LpaiContext.cpp, LpaiContextCustomConfig.h, ${HOST_ARCHITECTURE_LPAI}/LpaiContextCustomConfig.cpp |
✅ |
qnn_graph |
LpaiGraph.h, LpaiGraph.cpp |
✅ |
qnn_backend |
LpaiBackend.h, LpaiBackend.cpp, LpaiBackendCustomConfig.h, LpaiBackendCustomConfig.cpp |
✅ |
| Headers exported | backends/lpai/*.h |
✅ |
The HOST_ARCHITECTURE_LPAI variable is set to lpai/${platform} where platform is host for x86_64 and target otherwise — matching the buck convention.
Other Issues Found During Review
-
Naming inconsistency in runtime option member variables (
QnnExecuTorchBackend.h:69-70): Two member variables uselapiinstead oflpai:RuntimeOption qnn_runtime_lapi_fps_{false, 0}; // should be lpai RuntimeOption qnn_runtime_lapi_ftrt_ratio_{false, 0}; // should be lpaiWhile the remaining three use
lpaicorrectly. The macro keys inQnnExecuTorch.hall useLPAIcorrectly (QNN_RUNTIME_LPAI_FPS, etc.), so this is just a member variable naming inconsistency. It works correctly because the key strings match, but it's confusing. -
Typo in runtime option key (
QnnExecuTorch.h:25):#define QNN_RUNTIME_LPAI_CLIENT_PERF_TYPE "qnn_runtime_lpai_clent_perf_type"
clentshould beclient. This is the actual string key used at runtime, so it would need to match on both theset_optionandget_optionsides. Since both sides use the same macro, it's functionally correct but will be confusing for anyone using the string directly. -
quantizer.pyimports (flagged by @abhinaykukkadapu): Theexecutorch.extension.pybindings.portable_libandexecutorch.kernels.quantizedimports at the top ofquantizer.pypull in heavy runtime dependencies. This needs to be resolved — either move them to the call site that actually needs them, or add a clear explanation of why they must be at module level.
Summary
Buck files are properly updated for the LPAI backend. All new source files are covered by glob patterns in runtime/targets.bzl, and the quantizer deps in quantizer/targets.bzl match the new imports (though the imports themselves may need relocation per reviewer feedback). CMake and buck are consistent in their LPAI file coverage. The minor issues flagged above (naming inconsistency, typo, questionable imports) are not buck-build-breaking but should be cleaned up.
|
Hi @abhinaykukkadapu , |
Just imported again, will monitor, meanwhile can you try this: #18151 (comment) |
Updated. I tried this comment and it worked. It appears I didn't rebuild quantized_aot_lib, which resulted in it not loading correctly. |
Awesome, just imported again, will look at ci in the morning. |
|
@claude Could you please check if the buck build is working? |
3ee6c1e to
c01d384
Compare
|
@claude Could you please check if the buck build is working? |
|
Claude finished @abhinaykukkadapu's task in 1m 50s —— View job Buck Build File Analysis
|
| Pattern | Files matched | Status |
|---|---|---|
backends/lpai/*.cpp |
LpaiBackend.cpp, LpaiBackendCustomConfig.cpp, LpaiContext.cpp, LpaiGraph.cpp |
OK |
backends/lpai/host/*.cpp (AOT) / backends/lpai/target/*.cpp (device) |
host/LpaiDevice.cpp, host/LpaiContextCustomConfig.cpp / target/LpaiDevice.cpp, target/LpaiContextCustomConfig.cpp |
OK |
backends/lpai/*.h |
All 6 headers (LpaiBackend.h, LpaiBackendCustomConfig.h, LpaiContext.h, LpaiContextCustomConfig.h, LpaiDevice.h, LpaiGraph.h) |
OK |
The host/target directory convention matches the existing pattern used by GPU, HTP, and IR backends. All 8 .cpp files and 6 .h files are covered.
quantizer/targets.bzl — Updated per review feedback
Current deps:
deps = [
"//executorch/backends/transforms:decompose_sdpa",
"//executorch/kernels/quantized:aot_lib",
"//executorch/kernels/quantized:quantized_ops_lib",
],This reflects the latest review feedback from @abhinaykukkadapu — portable_lib was removed and replaced with aot_lib. The quantizer.py import now only uses import executorch.kernels.quantized (wrapped in try/except), which aligns with the aot_lib and quantized_ops_lib deps. This looks correct.
The glob ["*.py", "*/*.py"] properly captures the new annotators/lpai_rules.py file.
aot/python/targets.bzl — No issues
The PyQnnManagerAdaptor target already had quantized:aot_lib and quantized:quantized_ops_lib, and it depends on runtime:runtime which now includes LPAI via the runtime targets.bzl glob patterns. No changes needed here.
CMake vs Buck consistency
All LPAI files present in backends/CMakeLists.txt are also covered by the buck globs in runtime/targets.bzl:
| CMake target | LPAI files | Buck coverage |
|---|---|---|
qnn_device |
LpaiDevice.h, ${platform}/LpaiDevice.cpp |
backends/lpai/*.h + backends/lpai/{host,target}/*.cpp |
qnn_context |
LpaiContext.h, LpaiContext.cpp, LpaiContextCustomConfig.h, ${platform}/LpaiContextCustomConfig.cpp |
backends/lpai/*.{h,cpp} + backends/lpai/{host,target}/*.cpp |
qnn_graph |
LpaiGraph.h, LpaiGraph.cpp |
backends/lpai/*.{h,cpp} |
qnn_backend |
LpaiBackend.h, LpaiBackend.cpp, LpaiBackendCustomConfig.h, LpaiBackendCustomConfig.cpp |
backends/lpai/*.{h,cpp} |
CMake uses HOST_ARCHITECTURE_LPAI = lpai/${platform} where platform = host for x86_64 and target otherwise — this matches the buck include_aot_qnn_lib conditional exactly. Consistent.
Key difference: CMake is more granular, buck is monolithic
Note that CMake splits LPAI files across multiple fine-grained targets (qnn_device, qnn_context, qnn_graph, qnn_backend), while buck collects everything into a single runtime library via broad globs. This is the existing architectural pattern for all backends (HTP, GPU, IR) and is expected.
Summary
The buck build configuration appears correct. All new LPAI source files (8 .cpp, 6 .h) are properly covered by the glob patterns in runtime/targets.bzl. The quantizer/targets.bzl deps have been updated per @abhinaykukkadapu's feedback (using aot_lib instead of portable_lib). CMake and buck are consistent in their LPAI file coverage. No missing files or dependency issues were found in the static analysis.
Note: I can only do static analysis of the build files — I cannot actually execute a buck build. If you need a live buck build test, that would need to be done via the internal CI.
|
@shewu-quic sorry and thanks for patiently updating the PR, one last patch and i'm gonna merge it and fix forward if there are any more failures, but i believe this would resolve the failures as i put a dummy forward diff internally and it seems to pass. The problem is |
Summary:
- Add LPAI infra
- Because of the accuracy isue with quantize and dequantize
operations in LPAI, Q and DQ will fallback to CPU for the LPAI
backend.
- Support runtime option for `fps`, `ftrt_ratio`, `clent_perf_type`,
`affinity`, and `core_selection`. Refer to [QNN
doc](https://docs.qualcomm.com/doc/80-63442-10/topic/lpai_backend.html#full-json-scheme)
to get more details for these options.
- Support LPAI in `test_qnn_delegate.py`
- Refactor documentation for LPAI and GPU
- Add a script to sign the skel library
c01d384 to
25c600e
Compare
Thanks. I have applied this patch and rebased. |
Summary:
fps,ftrt_ratio,clent_perf_type,affinity, andcore_selection. Refer to QNN doc to get more details for these options.test_qnn_delegate.pyTest Plan