Add audio RAG pipeline#247
Conversation
📝 WalkthroughWalkthroughA new ChangesAudio RAG Application
Sequence Diagram(s)sequenceDiagram
actor User
participant Streamlit as Streamlit App
participant build_rag_pipeline
participant Transcribe
participant SpeechmaticsAPI
participant EmbedData as EmbedData (Voyage AI)
participant MongoVDB as MongoVDB (Atlas)
participant RAG as RAG (OpenRouter)
User->>Streamlit: upload audio file (mp3/wav/m4a)
Streamlit->>build_rag_pipeline: build_rag_pipeline(tmp_file_path)
build_rag_pipeline->>Transcribe: transcribe_audio(path)
Transcribe->>SpeechmaticsAPI: async batch submit + poll
SpeechmaticsAPI-->>Transcribe: diarized word results
Transcribe-->>build_rag_pipeline: List[{speaker, text}]
build_rag_pipeline->>EmbedData: embed(speaker_turn_docs)
EmbedData-->>build_rag_pipeline: document vectors
build_rag_pipeline->>MongoVDB: create_collection + ingest_data
MongoVDB-->>build_rag_pipeline: ingestion complete
build_rag_pipeline-->>Streamlit: RAG instance + transcripts
Streamlit-->>User: audio player + transcript shown
User->>Streamlit: enter chat prompt
Streamlit->>RAG: query(prompt)
RAG->>MongoVDB: $vectorSearch (via Retriever)
MongoVDB-->>RAG: top-k transcript contexts
RAG-->>Streamlit: streamed LLM delta chunks
Streamlit-->>User: streamed answer rendered
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (1)
audio-rag/README.md (1)
62-71: 💤 Low valueMinor style note: Excessive exclamation marks in the document.
The README contains several exclamation marks (approximately 4 across the section headers and contribution line). Consider toning this down slightly for a more professional tone.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/README.md` around lines 62 - 71, The README contains multiple exclamation marks that create an overly enthusiastic tone, reducing professionalism. Reduce the exclamation marks throughout the newsletter section header "Stay Updated with Our Newsletter!", the subscription copy text that includes "Stay in the loop with the latest tutorials, insights, and exclusive resources", the call-to-action "Subscribe now!", and the Contribution section text "Contributions are welcome!" by either removing them entirely or replacing them with periods or no punctuation to achieve a more measured and professional tone.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@audio-rag/app.py`:
- Around line 290-291: The file_key generation in the uploaded_file handling
(line 290) uses only name and size, which can be identical for different files,
causing incorrect pipeline result reuse. Replace the current file_key
construction that combines uploaded_file.name and uploaded_file.size with a more
unique identifier by computing a hash of the file contents (such as SHA256 or
MD5) and incorporating that into the file_key. This ensures different files,
even with the same name and size, will have distinct keys based on their actual
content. Also apply the same fix to the other location mentioned at line 297
where file_key is referenced.
- Around line 330-338: The streaming query loop that processes chunks from
st.session_state.query_engine.query(prompt) lacks error handling for mid-stream
failures (network or provider errors). Wrap the for loop that iterates over the
chunks in a try-except block to catch any exceptions that occur during
streaming, display an appropriate error message to the user using
message_placeholder, and ensure that the message state in
st.session_state.messages is handled consistently regardless of whether the
stream completes successfully or fails partway through.
- Around line 295-297: The uploaded filename from uploaded_file.name is
user-controlled and used directly in os.path.join() without sanitization, which
allows path traversal attacks where a crafted filename like
"../../../etc/passwd" could write files outside the intended temp_dir. Sanitize
the filename by applying os.path.basename() to uploaded_file.name before passing
it to os.path.join() to extract only the filename component and remove any
directory path traversal sequences.
- Around line 306-308: The exception handler that catches Exception as e and
uses st.error with the raw exception message {e} is exposing sensitive internal
details to end users. Replace the error message shown to users with a generic,
user-friendly message (like "An error occurred while processing your request.
Please try again.") and separately log the actual exception details (the full
error string from variable e) using a server-side logging mechanism that only
developers can access, not the end user interface.
- Around line 21-25: The reset_chat() function clears the processed_file_key but
the file uploader widget with static key "audio_uploader" retains the previous
file selection, causing the condition at line 289 to re-ingest the same file. To
fix this, add a session state counter variable that increments each time the
Clear button is pressed within reset_chat(), and use this counter to make the
file uploader widget key dynamic (e.g., "audio_uploader_{counter}") so it fully
resets its selection state and prevents re-processing the same file.
In `@audio-rag/rag_code.py`:
- Around line 118-127: The ingest_data method currently clears all documents
from the shared collection without any scope or filtering, which causes data
from different users/sessions to interfere with each other. Add a source_id or
session identifier field to each document in the docs list comprehension
alongside the context and embedding field, then modify the delete_many call to
filter by this source_id instead of clearing all documents unconditionally. This
ensures that each upload only removes its own documents from the collection.
Additionally, update the retriever's query logic (the method that uses the
collection for retrieval around lines 138-146) to filter results by the same
source_id field so users only see their own data.
- Around line 252-256: The transcription pipeline continues to embedding and RAG
setup even when the transcripts are empty, resulting in an unusable query
engine. After creating the documents list from the transcripts in the line that
calls format_speaker, add a validation check to ensure the documents list is not
empty. If documents is empty, raise an appropriate exception to stop the
pipeline and prevent proceeding with the embed method call on the EmbedData
instance.
- Around line 122-125: The zip function in the docs list comprehension does not
use the strict parameter, which can silently truncate data if embeddata.contexts
and embeddata.embeddings have mismatched lengths. Since the project requires
Python 3.11 or later, modify the zip call to include strict=True as a parameter
to ensure an error is raised if the two iterables have different lengths,
preventing silent data loss during the collection clearing and embedding
operation.
- Around line 88-90: In the code where you check for existing vector indexes and
return early after verifying the index name matches self.VECTOR_INDEX_NAME, add
a call to self._wait_for_index_ready() before returning to ensure the index is
queryable. Additionally, in the _wait_for_index_ready() method, replace the
silent return when timeout expires with raising a TimeoutError so that setup
failures are detected immediately at configuration time rather than allowing
non-ready indexes to proceed to ingestion or retrieval operations.
In `@audio-rag/README.md`:
- Line 65: The image URL in the markdown link on line 65 references the
incorrect repository path. Update the URL to replace the repository reference
from `patchy631/ai-engineering` to `patchy631/ai-engineering-hub` to match the
correct repository where this PR resides. Ensure the full path to the image
resource, including the `resources/join_ddods.png` file, matches the actual
location in the correct repository so the image displays properly when rendered
on GitHub.
---
Nitpick comments:
In `@audio-rag/README.md`:
- Around line 62-71: The README contains multiple exclamation marks that create
an overly enthusiastic tone, reducing professionalism. Reduce the exclamation
marks throughout the newsletter section header "Stay Updated with Our
Newsletter!", the subscription copy text that includes "Stay in the loop with
the latest tutorials, insights, and exclusive resources", the call-to-action
"Subscribe now!", and the Contribution section text "Contributions are welcome!"
by either removing them entirely or replacing them with periods or no
punctuation to achieve a more measured and professional tone.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 22378a34-fbfb-4f8f-b998-91fc28f1781e
⛔ Files ignored due to path filters (5)
audio-rag/assets/deep-seek.pngis excluded by!**/*.pngaudio-rag/assets/speechmatics_logo.pngis excluded by!**/*.pngaudio-rag/assets/voyageai-mongodb-logo.pngis excluded by!**/*.pngaudio-rag/podcast.mp3is excluded by!**/*.mp3audio-rag/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
audio-rag/.env.exampleaudio-rag/README.mdaudio-rag/app.pyaudio-rag/pyproject.tomlaudio-rag/rag_code.py
| def reset_chat(): | ||
| st.session_state.messages = [] | ||
| st.session_state.transcripts = None | ||
| st.session_state.query_engine = None | ||
| st.session_state.processed_file_key = None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the audio-rag directory and examine the app.py file
find . -type f -name "app.py" | grep -i audioRepository: patchy631/ai-engineering-hub
Length of output: 149
🏁 Script executed:
# Get file size and basic structure
wc -l audio-rag/app.py 2>/dev/null || echo "File not found at that path"Repository: patchy631/ai-engineering-hub
Length of output: 93
🏁 Script executed:
# Read the relevant sections of the file
echo "=== Lines 1-30 (including reset_chat function) ==="
head -30 audio-rag/app.py | cat -n
echo ""
echo "=== Lines 270-310 (additional mentioned sections) ==="
sed -n '270,310p' audio-rag/app.py | cat -nRepository: patchy631/ai-engineering-hub
Length of output: 2969
🏁 Script executed:
# Let me examine the full context around the file_uploader and the reset_chat usage
echo "=== Lines 15-35 (initialization and reset_chat definition) ==="
sed -n '15,35p' audio-rag/app.py | cat -n
echo ""
echo "=== Full sidebar section to understand context (lines 270-310) ==="
sed -n '270,310p' audio-rag/app.py | cat -nRepository: patchy631/ai-engineering-hub
Length of output: 2895
🏁 Script executed:
# Search for where reset_chat() is called
echo "=== Searching for reset_chat() calls ==="
grep -n "reset_chat()" audio-rag/app.py
# Search for button interactions that might trigger resets
echo ""
echo "=== Searching for st.button or st.sidebar interactions ==="
grep -n "st.button\|st.column\|st.chat" audio-rag/app.py | head -20Repository: patchy631/ai-engineering-hub
Length of output: 583
🌐 Web query:
Streamlit file_uploader widget key persistence session_state behavior rerun reset
💡 Result:
In Streamlit, the st.file_uploader widget has specific behaviors regarding key persistence, session state, and resets [1][2]. Key Persistence and Reset Behavior When you provide a key to st.file_uploader, it stabilizes the widget's identity, allowing Streamlit to preserve the widget's value (the uploaded file) across standard script reruns [2]. However, the widget will reset—clearing the uploaded file—if any of the following occur: - The widget is removed from the script (e.g., hidden via conditional logic) [1]. - The key itself is changed [3]. - Specific parameters, such as type, accept_multiple_files, or max_upload_size, are modified, even if the key remains the same [2]. - The user manually resets the widget via the UI [4]. Session State Interaction - Access: You can access the uploaded file via st.session_state[key] as a read-only value [2]. - Modification Limitations: You cannot manually set or modify the value of an st.file_uploader via st.session_state (e.g., trying to force a file into the uploader programmatically) [5][6][7]. Attempting to do so will raise a StreamlitAPIException [6][7]. - Persistence Strategy: Because the file uploader widget itself is ephemeral, the best practice for persisting uploaded data across reruns or navigation (e.g., in multipage apps) is to store the uploaded file object or its processed contents into a separate variable within st.session_state immediately after it is uploaded [1][8]. Handling Reruns Streamlit reruns the entire script upon user interaction [1]. If an st.file_uploader is present, it will maintain its value across these reruns as long as the widget remains part of the execution flow [1][2]. If you encounter issues where the file seems to be re-processed or "re-uploaded" on every interaction, this is often because the processing logic is placed directly in the main execution flow; wrapping such logic in an on_change callback associated with the uploader or checking if the file is already processed in st.session_state can prevent redundant operations [4][9].
Citations:
- 1: https://github.com/streamlit/docs/blob/main/content/develop/concepts/architecture/widget-behavior.md
- 2: https://docs.streamlit.io/develop/api-reference/widgets/st.file_uploader
- 3: https://discuss.streamlit.io/t/does-switching-file-uploader-keys-reset-uploaded-files/119465
- 4: file_uploader triggering second time when we click on other buttons in the same page. streamlit/streamlit#3838
- 5: https://github.com/streamlit/docs/blob/main/content/develop/api-reference/caching-and-state/session_state.md
- 6: https://github.com/streamlit/docs/blob/main/content/develop/concepts/architecture/session-state.md
- 7: https://docs.streamlit.io/develop/api-reference/caching-and-state/st.session_state
- 8: https://discuss.streamlit.io/t/how-to-use-session-state-to-save-file-uploads-and-filters/36443
- 9: 🎬 : Improved file uploader and camera input to call its on_change handler only when necessary streamlit/streamlit#4270
🏁 Script executed:
# Check lines around where the Clear button is, to see the full context
echo "=== Lines 240-260 (Clear button and chat area) ==="
sed -n '240,260p' audio-rag/app.py | cat -n
# Check if there are any other interactions that might affect the flow
echo ""
echo "=== Lines 320-338 (chat input and response handling) ==="
sed -n '320,338p' audio-rag/app.py | cat -nRepository: patchy631/ai-engineering-hub
Length of output: 2058
🏁 Script executed:
# Verify the exact line numbers for the additional references
echo "=== Lines 277-283 (file uploader declaration) ==="
sed -n '277,283p' audio-rag/app.py | cat -n
echo ""
echo "=== Lines 289-304 (file processing logic) ==="
sed -n '289,304p' audio-rag/app.py | cat -nRepository: patchy631/ai-engineering-hub
Length of output: 1352
Clear button triggers unnecessary re-ingestion of the same audio file.
When the Clear button calls reset_chat(), it sets processed_file_key to None but the file uploader widget selection persists because its key remains static ("audio_uploader"). On the next rerun, uploaded_file still contains the previously uploaded file, causing the condition at line 289 to evaluate to True (since None != current file_key), which re-executes the expensive pipeline (transcription and MongoDB storage) for the same file.
💡 Suggested fix
if "messages" not in st.session_state:
st.session_state.messages = []
+if "uploader_nonce" not in st.session_state:
+ st.session_state.uploader_nonce = 0
def reset_chat():
st.session_state.messages = []
st.session_state.transcripts = None
st.session_state.query_engine = None
st.session_state.processed_file_key = None
+ st.session_state.uploader_nonce += 1
uploaded_file = st.file_uploader(
"Choose your audio file",
type=["mp3", "wav", "m4a"],
accept_multiple_files=False,
label_visibility="collapsed",
- key="audio_uploader",
+ key=f"audio_uploader_{st.session_state.uploader_nonce}",
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/app.py` around lines 21 - 25, The reset_chat() function clears the
processed_file_key but the file uploader widget with static key "audio_uploader"
retains the previous file selection, causing the condition at line 289 to
re-ingest the same file. To fix this, add a session state counter variable that
increments each time the Clear button is pressed within reset_chat(), and use
this counter to make the file uploader widget key dynamic (e.g.,
"audio_uploader_{counter}") so it fully resets its selection state and prevents
re-processing the same file.
| file_key = f"{uploaded_file.name}-{uploaded_file.size}" | ||
|
|
There was a problem hiding this comment.
name-size file key can reuse stale pipeline results.
Two different files can share the same name and size, causing transcript/query_engine reuse for the wrong audio.
🧩 Suggested fix
+import hashlib
...
- file_key = f"{uploaded_file.name}-{uploaded_file.size}"
+ file_bytes = uploaded_file.getvalue()
+ file_hash = hashlib.sha256(file_bytes).hexdigest()
+ file_key = f"{uploaded_file.name}-{uploaded_file.size}-{file_hash}"
...
- f.write(uploaded_file.getvalue())
+ f.write(file_bytes)Also applies to: 297-297
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/app.py` around lines 290 - 291, The file_key generation in the
uploaded_file handling (line 290) uses only name and size, which can be
identical for different files, causing incorrect pipeline result reuse. Replace
the current file_key construction that combines uploaded_file.name and
uploaded_file.size with a more unique identifier by computing a hash of the file
contents (such as SHA256 or MD5) and incorporating that into the file_key. This
ensures different files, even with the same name and size, will have distinct
keys based on their actual content. Also apply the same fix to the other
location mentioned at line 297 where file_key is referenced.
| file_path = os.path.join(temp_dir, uploaded_file.name) | ||
| with open(file_path, "wb") as f: | ||
| f.write(uploaded_file.getvalue()) |
There was a problem hiding this comment.
Sanitize uploaded filename before writing to disk.
The uploaded filename is user-controlled and is joined directly into a filesystem path. A crafted filename can escape the temp dir (path traversal) or target unintended locations.
🔒 Suggested fix
- file_path = os.path.join(temp_dir, uploaded_file.name)
+ safe_name = Path(uploaded_file.name).name
+ file_path = os.path.join(temp_dir, safe_name)
with open(file_path, "wb") as f:
f.write(uploaded_file.getvalue())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| file_path = os.path.join(temp_dir, uploaded_file.name) | |
| with open(file_path, "wb") as f: | |
| f.write(uploaded_file.getvalue()) | |
| safe_name = Path(uploaded_file.name).name | |
| file_path = os.path.join(temp_dir, safe_name) | |
| with open(file_path, "wb") as f: | |
| f.write(uploaded_file.getvalue()) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/app.py` around lines 295 - 297, The uploaded filename from
uploaded_file.name is user-controlled and used directly in os.path.join()
without sanitization, which allows path traversal attacks where a crafted
filename like "../../../etc/passwd" could write files outside the intended
temp_dir. Sanitize the filename by applying os.path.basename() to
uploaded_file.name before passing it to os.path.join() to extract only the
filename component and remove any directory path traversal sequences.
| except Exception as e: | ||
| st.error(f"An error occurred: {e}") | ||
| st.stop() |
There was a problem hiding this comment.
Don’t expose raw exception text to end users.
Rendering str(e) directly in UI can leak internal/provider details. Show a generic error to users and log exception details server-side.
🛠 Suggested fix
+import logging
...
+logger = logging.getLogger(__name__)
...
- except Exception as e:
- st.error(f"An error occurred: {e}")
+ except Exception:
+ logger.exception("Failed to build RAG pipeline for uploaded audio")
+ st.error("Failed to process the audio file. Please try again.")
st.stop()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| st.error(f"An error occurred: {e}") | |
| st.stop() | |
| except Exception: | |
| logger.exception("Failed to build RAG pipeline for uploaded audio") | |
| st.error("Failed to process the audio file. Please try again.") | |
| st.stop() |
🧰 Tools
🪛 Ruff (0.15.17)
[warning] 306-306: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/app.py` around lines 306 - 308, The exception handler that catches
Exception as e and uses st.error with the raw exception message {e} is exposing
sensitive internal details to end users. Replace the error message shown to
users with a generic, user-friendly message (like "An error occurred while
processing your request. Please try again.") and separately log the actual
exception details (the full error string from variable e) using a server-side
logging mechanism that only developers can access, not the end user interface.
Source: Linters/SAST tools
| with st.chat_message("assistant"): | ||
| message_placeholder = st.empty() | ||
| full_response = "" | ||
| for chunk in st.session_state.query_engine.query(prompt): | ||
| full_response += chunk.delta or "" | ||
| message_placeholder.markdown(full_response + "▌") | ||
| message_placeholder.markdown(full_response) | ||
|
|
||
| st.session_state.messages.append({"role": "assistant", "content": full_response}) |
There was a problem hiding this comment.
Handle streaming failures in chat response path.
query_engine.query(prompt) can fail mid-stream (network/provider errors). Without local handling, the run crashes and leaves message state inconsistent.
🧯 Suggested fix
with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
- for chunk in st.session_state.query_engine.query(prompt):
- full_response += chunk.delta or ""
- message_placeholder.markdown(full_response + "▌")
+ try:
+ for chunk in st.session_state.query_engine.query(prompt):
+ full_response += chunk.delta or ""
+ message_placeholder.markdown(full_response + "▌")
+ except Exception:
+ if not full_response:
+ full_response = "I hit an error while generating a response. Please try again."
message_placeholder.markdown(full_response)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with st.chat_message("assistant"): | |
| message_placeholder = st.empty() | |
| full_response = "" | |
| for chunk in st.session_state.query_engine.query(prompt): | |
| full_response += chunk.delta or "" | |
| message_placeholder.markdown(full_response + "▌") | |
| message_placeholder.markdown(full_response) | |
| st.session_state.messages.append({"role": "assistant", "content": full_response}) | |
| with st.chat_message("assistant"): | |
| message_placeholder = st.empty() | |
| full_response = "" | |
| try: | |
| for chunk in st.session_state.query_engine.query(prompt): | |
| full_response += chunk.delta or "" | |
| message_placeholder.markdown(full_response + "▌") | |
| except Exception: | |
| if not full_response: | |
| full_response = "I hit an error while generating a response. Please try again." | |
| message_placeholder.markdown(full_response) | |
| st.session_state.messages.append({"role": "assistant", "content": full_response}) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/app.py` around lines 330 - 338, The streaming query loop that
processes chunks from st.session_state.query_engine.query(prompt) lacks error
handling for mid-stream failures (network or provider errors). Wrap the for loop
that iterates over the chunks in a try-except block to catch any exceptions that
occur during streaming, display an appropriate error message to the user using
message_placeholder, and ensure that the message state in
st.session_state.messages is handled consistently regardless of whether the
stream completes successfully or fails partway through.
| existing = list(self.collection.list_search_indexes()) | ||
| if any(idx["name"] == self.VECTOR_INDEX_NAME for idx in existing): | ||
| return |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "rag_code.py" -type fRepository: patchy631/ai-engineering-hub
Length of output: 311
🏁 Script executed:
cat -n audio-rag/rag_code.py | sed -n '80,125p'Repository: patchy631/ai-engineering-hub
Length of output: 2214
Fail fast when the vector index is not queryable.
Line 90 returns immediately for existing indexes without verifying they are queryable, and Line 117 silently returns on timeout instead of raising an error. This allows non-ready indexes to proceed to ingestion/retrieval, causing failures to surface later during $vectorSearch operations instead of at setup time.
The fix requires two changes:
- Call
self._wait_for_index_ready()after confirming an index exists (line 90) - Raise
TimeoutErrorin_wait_for_index_ready()when timeout expires (after line 116)
Proposed changes
existing = list(self.collection.list_search_indexes())
if any(idx["name"] == self.VECTOR_INDEX_NAME for idx in existing):
+ self._wait_for_index_ready()
return
@@
def _wait_for_index_ready(self, timeout=120):
# Atlas builds search indexes asynchronously; poll until queryable.
start = time.time()
while time.time() - start < timeout:
indexes = list(self.collection.list_search_indexes(self.VECTOR_INDEX_NAME))
if indexes and indexes[0].get("queryable"):
return
time.sleep(2)
+ raise TimeoutError(f"Vector search index {self.VECTOR_INDEX_NAME!r} was not queryable after {timeout}s")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/rag_code.py` around lines 88 - 90, In the code where you check for
existing vector indexes and return early after verifying the index name matches
self.VECTOR_INDEX_NAME, add a call to self._wait_for_index_ready() before
returning to ensure the index is queryable. Additionally, in the
_wait_for_index_ready() method, replace the silent return when timeout expires
with raising a TimeoutError so that setup failures are detected immediately at
configuration time rather than allowing non-ready indexes to proceed to
ingestion or retrieval operations.
| def ingest_data(self, embeddata): | ||
| # Clear any previous data for this demo collection so re-uploads don't duplicate. | ||
| self.collection.delete_many({}) | ||
|
|
||
| docs = [ | ||
| {"context": context, self.EMBEDDING_FIELD: embedding} | ||
| for context, embedding in zip(embeddata.contexts, embeddata.embeddings) | ||
| ] | ||
| for batch in batch_iterate(docs, INGEST_BATCH_SIZE): | ||
| self.collection.insert_many(batch) |
There was a problem hiding this comment.
Scope vector documents per upload/session before deleting or retrieving.
Every pipeline instance uses the same collection, but Line 120 clears all documents and the retriever has no per-upload filter. In the Streamlit flow, a second upload can replace the data behind an existing session’s query_engine, causing cross-user transcript leakage or wrong answers.
🔒 Suggested direction
+import uuid
@@
def __init__(
self,
db_name=DB_NAME,
collection_name=COLLECTION_NAME,
vector_dim=VECTOR_DIM,
+ source_id=None,
):
self.db_name = db_name
self.collection_name = collection_name
self.vector_dim = vector_dim
+ self.source_id = source_id or uuid.uuid4().hex
@@
"fields": [
{
"type": "vector",
"path": self.EMBEDDING_FIELD,
"numDimensions": self.vector_dim,
"similarity": "dotProduct",
- }
+ },
+ {"type": "filter", "path": "source_id"},
]
@@
- self.collection.delete_many({})
+ self.collection.delete_many({"source_id": self.source_id})
@@
- {"context": context, self.EMBEDDING_FIELD: embedding}
+ {"context": context, "source_id": self.source_id, self.EMBEDDING_FIELD: embedding}
for context, embedding in zip(embeddata.contexts, embeddata.embeddings)
@@
"$vectorSearch": {
"index": self.vector_db.VECTOR_INDEX_NAME,
"path": self.vector_db.EMBEDDING_FIELD,
"queryVector": query_embedding,
"numCandidates": num_candidates,
"limit": top_k,
+ "filter": {"source_id": self.vector_db.source_id},
}If the index already exists, this also needs an index migration/recreate path so the source_id filter field is available.
Also applies to: 138-146
🧰 Tools
🪛 Ruff (0.15.17)
[warning] 124-124: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/rag_code.py` around lines 118 - 127, The ingest_data method
currently clears all documents from the shared collection without any scope or
filtering, which causes data from different users/sessions to interfere with
each other. Add a source_id or session identifier field to each document in the
docs list comprehension alongside the context and embedding field, then modify
the delete_many call to filter by this source_id instead of clearing all
documents unconditionally. This ensures that each upload only removes its own
documents from the collection. Additionally, update the retriever's query logic
(the method that uses the collection for retrieval around lines 138-146) to
filter results by the same source_id field so users only see their own data.
| docs = [ | ||
| {"context": context, self.EMBEDDING_FIELD: embedding} | ||
| for context, embedding in zip(embeddata.contexts, embeddata.embeddings) | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Inspect Python version declarations relevant to zip(strict=True).
fd -HI '^pyproject\.toml$' -x sh -c 'printf "\n== %s ==\n" "$1"; grep -nE "requires-python|target-version" "$1" || true' sh {}Repository: patchy631/ai-engineering-hub
Length of output: 3695
Add strict=True to prevent silent data truncation.
Line 124 silently truncates if embeddings and contexts have mismatched counts, which risks data loss since the collection is already cleared. The project requires Python ≥3.11, which supports zip(strict=True).
🐛 Proposed fix
docs = [
{"context": context, self.EMBEDDING_FIELD: embedding}
- for context, embedding in zip(embeddata.contexts, embeddata.embeddings)
+ for context, embedding in zip(embeddata.contexts, embeddata.embeddings, strict=True)
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| docs = [ | |
| {"context": context, self.EMBEDDING_FIELD: embedding} | |
| for context, embedding in zip(embeddata.contexts, embeddata.embeddings) | |
| ] | |
| docs = [ | |
| {"context": context, self.EMBEDDING_FIELD: embedding} | |
| for context, embedding in zip(embeddata.contexts, embeddata.embeddings, strict=True) | |
| ] |
🧰 Tools
🪛 Ruff (0.15.17)
[warning] 124-124: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/rag_code.py` around lines 122 - 125, The zip function in the docs
list comprehension does not use the strict parameter, which can silently
truncate data if embeddata.contexts and embeddata.embeddings have mismatched
lengths. Since the project requires Python 3.11 or later, modify the zip call to
include strict=True as a parameter to ensure an error is raised if the two
iterables have different lengths, preventing silent data loss during the
collection clearing and embedding operation.
Source: Linters/SAST tools
| transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path) | ||
| documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts] | ||
|
|
||
| embeddata = EmbedData() | ||
| embeddata.embed(documents) |
There was a problem hiding this comment.
Stop the pipeline when transcription produces no documents.
If the audio has no transcribed turns, this still proceeds into embedding/RAG setup and can return a query engine with no retrievable context.
🐛 Proposed fix
transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path)
- documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts]
+ documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts if t["text"].strip()]
+ if not documents:
+ raise ValueError("No transcribable speech was found in the uploaded audio.")
embeddata = EmbedData()
embeddata.embed(documents)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path) | |
| documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts] | |
| embeddata = EmbedData() | |
| embeddata.embed(documents) | |
| transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path) | |
| documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts if t["text"].strip()] | |
| if not documents: | |
| raise ValueError("No transcribable speech was found in the uploaded audio.") | |
| embeddata = EmbedData() | |
| embeddata.embed(documents) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/rag_code.py` around lines 252 - 256, The transcription pipeline
continues to embedding and RAG setup even when the transcripts are empty,
resulting in an unusable query engine. After creating the documents list from
the transcripts in the line that calls format_speaker, add a validation check to
ensure the documents list is not empty. If documents is empty, raise an
appropriate exception to stop the pipeline and prevent proceeding with the embed
method call on the EmbedData instance.
| ## 📬 Stay Updated with Our Newsletter! | ||
| **Get a FREE Data Science eBook** 📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! Stay in the loop with the latest tutorials, insights, and exclusive resources. [Subscribe now!](https://join.dailydoseofds.com) | ||
|
|
||
| [](https://join.dailydoseofds.com) |
There was a problem hiding this comment.
Verify the image URL references the correct repository.
The image link references patchy631/ai-engineering/blob/main/resources/..., but the PR is in patchy631/ai-engineering-hub. This path mismatch will likely result in a broken image (404) when rendered on GitHub, preventing the newsletter call-to-action from displaying correctly.
🔗 Correcting the image URL
-[](https://join.dailydoseofds.com)
+[](https://join.dailydoseofds.com)Alternatively, verify the correct repository and path where this image resource actually resides.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| [](https://join.dailydoseofds.com) | |
| [](https://join.dailydoseofds.com) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-rag/README.md` at line 65, The image URL in the markdown link on line
65 references the incorrect repository path. Update the URL to replace the
repository reference from `patchy631/ai-engineering` to
`patchy631/ai-engineering-hub` to match the correct repository where this PR
resides. Ensure the full path to the image resource, including the
`resources/join_ddods.png` file, matches the actual location in the correct
repository so the image displays properly when rendered on GitHub.
Summary by CodeRabbit