How I can optimize speed for LLM infernce using prefix caching??
Should I use dynamic cache or static cache?
And when use prefix caching, Can I change the order
- llm_input = sos_emb, embedding, text (prompt_text + target_text), task_id_emb, prompt_speech_token_emb
to
- llm_input : sos_emb, embedding, prompt_text , task_id_emb, prompt_speech_token_emb, target_text