-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[NPUW] Async weights bank processing and closure guard #32505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[NPUW] Async weights bank processing and closure guard #32505
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't checked all the changes but put some thoughts here.
I hope the # of changes can be reduced but with the same effect.
// Need to wrap closure, since finalize_weights_bank() will | ||
// asynchronously evaluate weights and put them in closure. | ||
// Other functions of CompiledModel as well as InferRequest and | ||
// other entities need to wait for the closure to be populated first | ||
// (meaning to wait for async weights processing to end). | ||
class SafeClosureWrapper { | ||
public: | ||
std::vector<ov::Tensor>& unsafe_get_closure() { | ||
return m_closure; | ||
} | ||
std::vector<ov::Tensor>& get_closure() { | ||
if (m_evaluated) { | ||
return m_closure; | ||
} | ||
if (m_evaluation.valid()) { | ||
m_evaluation.wait(); | ||
m_evaluated = true; | ||
} | ||
return m_closure; | ||
} | ||
void set_future(std::shared_future<void>& evaluation) { | ||
m_evaluation = evaluation; | ||
} | ||
|
||
private: | ||
std::vector<ov::Tensor> m_closure; | ||
std::shared_future<void> m_evaluation; | ||
bool m_evaluated = false; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it to utils
Make it a generic template over type T.
Assume you don't know anything about closure here.
template<class T>
class delayed {
T t;
T& get() { if (...) { ...}
const T& get() const { ... }.
};
as easy as that. I assume you can implement the both above methods with get_impl. so is the unsafe_get()
.
If you really want to go with closure specifics here, as some places in the edited code might suggest, maybe it'd be also worth to override operator[]
to make the integration more seamless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
std::shared_future<void> m_evaluation; | ||
bool m_evaluated = false; | ||
}; | ||
mutable SafeClosureWrapper closure; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to make this field mutable, it should maintain the same contracts as the original vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
Outdated
Show resolved
Hide resolved
{char{0x4c}, char{0x4c}, char{0x4d}, char{0x43}, char{0x4d}, char{0x4f}}; | ||
|
||
const constexpr char* NPUW_SERIALIZATION_VERSION = "0.13"; | ||
const constexpr char* NPUW_SERIALIZATION_VERSION = "0.14"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in the format, however I fixed a potential bug in deserialization. Decided to bump just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how this changes a version tbh
compiled->m_kvcache_compiled->m_import_weights_ctx.reset(); | ||
compiled->m_prefill_compiled->finalize_weights_bank(); | ||
compiled->m_prefill_compiled->m_import_weights_ctx.reset(); | ||
|
||
if (compiled->m_lm_head_compiled) { | ||
compiled->m_lm_head_compiled->m_weights_bank = bank; | ||
|
||
compiled->m_lm_head_compiled->finalize_weights_bank(); | ||
compiled->m_lm_head_compiled->m_import_weights_ctx.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I asked but it seems I didn't. Why do we need all these changes here? Why did it work before and how does it work now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's done after bank evaluation is finished in finalize_weights_bank()
function. I reset the context to potentially release some memory after the import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its good we have tests now but do they track memory consumption changes as well or it is not supposed to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not supposed to change much (there should be no dangling references or actual mmaped data). But you are right, we don't have memory consumption tests yet
build_jenkins |
The tests is failing:
As well as
|
Alternative to #32424