Skip to content

Add audio RAG pipeline#247

Open
namanvirk18 wants to merge 1 commit into
patchy631:mainfrom
namanvirk18:audio-rag
Open

Add audio RAG pipeline#247
namanvirk18 wants to merge 1 commit into
patchy631:mainfrom
namanvirk18:audio-rag

Conversation

@namanvirk18

@namanvirk18 namanvirk18 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features
    • Launched a new audio RAG application enabling users to upload audio files (mp3/wav/m4a) and ask questions via chat-based Q&A.
    • Integrated automatic audio transcription with speaker labels and diarization.
    • Added an in-app audio player and formatted transcript display for easy reference during conversations.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

A new audio-rag project is added, implementing a full RAG pipeline over audio. It transcribes audio with Speechmatics (speaker diarization), embeds speaker-turn documents via Voyage AI, stores and retrieves them from MongoDB Atlas Vector Search, streams answers from an OpenRouter LLM, and exposes the entire workflow through a Streamlit chat UI.

Changes

Audio RAG Application

Layer / File(s) Summary
Project config, env, and docs
audio-rag/pyproject.toml, audio-rag/.env.example, audio-rag/README.md
Defines runtime dependencies and uv settings in pyproject.toml, supplies .env.example placeholders for Speechmatics, Voyage AI, MongoDB, and OpenRouter, and documents setup/run steps in README.md.
Audio transcription and speaker formatting
audio-rag/rag_code.py (lines 1–31, 191–248)
Declares module constants and batch-iterate utility; implements Transcribe using the Speechmatics async batch API to group diarized word results into consecutive speaker turns; adds format_speaker to map S1/S2 tags to Speaker A/Speaker B labels.
Voyage AI embedding and MongoDB vector store
audio-rag/rag_code.py (lines 33–128)
Defines EmbedData wrapping the Voyage AI client for batch document and single-query embedding; defines MongoVDB with collection creation, vector search index setup with readiness polling, data clearing, and batched document ingestion.
Retriever, RAG, and pipeline orchestrator
audio-rag/rag_code.py (lines 130–189, 250–264)
Implements Retriever.search executing a MongoDB $vectorSearch aggregation; implements RAG with context-assembly prompt and OpenRouter LLM streaming; build_rag_pipeline chains transcription → embedding → ingestion → retrieval setup and returns a RAG instance with raw transcripts.
Streamlit app
audio-rag/app.py
Initializes session state and reset_chat; adds _render_transcript, _logo_data_uri, and _inject_styles helpers; implements render_header with logos and clear button; sidebar upload flow validates file size, keys by name-size, calls build_rag_pipeline in a spinner, and shows audio player and transcript; chat section renders history and streams assistant responses chunk-by-chunk.

Sequence Diagram(s)

sequenceDiagram
  actor User
  participant Streamlit as Streamlit App
  participant build_rag_pipeline
  participant Transcribe
  participant SpeechmaticsAPI
  participant EmbedData as EmbedData (Voyage AI)
  participant MongoVDB as MongoVDB (Atlas)
  participant RAG as RAG (OpenRouter)

  User->>Streamlit: upload audio file (mp3/wav/m4a)
  Streamlit->>build_rag_pipeline: build_rag_pipeline(tmp_file_path)
  build_rag_pipeline->>Transcribe: transcribe_audio(path)
  Transcribe->>SpeechmaticsAPI: async batch submit + poll
  SpeechmaticsAPI-->>Transcribe: diarized word results
  Transcribe-->>build_rag_pipeline: List[{speaker, text}]
  build_rag_pipeline->>EmbedData: embed(speaker_turn_docs)
  EmbedData-->>build_rag_pipeline: document vectors
  build_rag_pipeline->>MongoVDB: create_collection + ingest_data
  MongoVDB-->>build_rag_pipeline: ingestion complete
  build_rag_pipeline-->>Streamlit: RAG instance + transcripts
  Streamlit-->>User: audio player + transcript shown
  User->>Streamlit: enter chat prompt
  Streamlit->>RAG: query(prompt)
  RAG->>MongoVDB: $vectorSearch (via Retriever)
  MongoVDB-->>RAG: top-k transcript contexts
  RAG-->>Streamlit: streamed LLM delta chunks
  Streamlit-->>User: streamed answer rendered
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 A bunny hopped in, mic in paw,
Transcribed each speaker without a flaw.
Embeddings flew to Mongo's store,
OpenRouter answered, chunk by chunk galore.
"Who said what?"—the RAG replied,
With Atlas vectors and Voyage AI guide! 🎙️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add audio RAG pipeline' directly and clearly describes the main change—introducing a new audio-based RAG (Retrieval-Augmented Generation) system with all supporting components (transcription, embeddings, vector store, LLM integration, and UI).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (1)
audio-rag/README.md (1)

62-71: 💤 Low value

Minor style note: Excessive exclamation marks in the document.

The README contains several exclamation marks (approximately 4 across the section headers and contribution line). Consider toning this down slightly for a more professional tone.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/README.md` around lines 62 - 71, The README contains multiple
exclamation marks that create an overly enthusiastic tone, reducing
professionalism. Reduce the exclamation marks throughout the newsletter section
header "Stay Updated with Our Newsletter!", the subscription copy text that
includes "Stay in the loop with the latest tutorials, insights, and exclusive
resources", the call-to-action "Subscribe now!", and the Contribution section
text "Contributions are welcome!" by either removing them entirely or replacing
them with periods or no punctuation to achieve a more measured and professional
tone.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@audio-rag/app.py`:
- Around line 290-291: The file_key generation in the uploaded_file handling
(line 290) uses only name and size, which can be identical for different files,
causing incorrect pipeline result reuse. Replace the current file_key
construction that combines uploaded_file.name and uploaded_file.size with a more
unique identifier by computing a hash of the file contents (such as SHA256 or
MD5) and incorporating that into the file_key. This ensures different files,
even with the same name and size, will have distinct keys based on their actual
content. Also apply the same fix to the other location mentioned at line 297
where file_key is referenced.
- Around line 330-338: The streaming query loop that processes chunks from
st.session_state.query_engine.query(prompt) lacks error handling for mid-stream
failures (network or provider errors). Wrap the for loop that iterates over the
chunks in a try-except block to catch any exceptions that occur during
streaming, display an appropriate error message to the user using
message_placeholder, and ensure that the message state in
st.session_state.messages is handled consistently regardless of whether the
stream completes successfully or fails partway through.
- Around line 295-297: The uploaded filename from uploaded_file.name is
user-controlled and used directly in os.path.join() without sanitization, which
allows path traversal attacks where a crafted filename like
"../../../etc/passwd" could write files outside the intended temp_dir. Sanitize
the filename by applying os.path.basename() to uploaded_file.name before passing
it to os.path.join() to extract only the filename component and remove any
directory path traversal sequences.
- Around line 306-308: The exception handler that catches Exception as e and
uses st.error with the raw exception message {e} is exposing sensitive internal
details to end users. Replace the error message shown to users with a generic,
user-friendly message (like "An error occurred while processing your request.
Please try again.") and separately log the actual exception details (the full
error string from variable e) using a server-side logging mechanism that only
developers can access, not the end user interface.
- Around line 21-25: The reset_chat() function clears the processed_file_key but
the file uploader widget with static key "audio_uploader" retains the previous
file selection, causing the condition at line 289 to re-ingest the same file. To
fix this, add a session state counter variable that increments each time the
Clear button is pressed within reset_chat(), and use this counter to make the
file uploader widget key dynamic (e.g., "audio_uploader_{counter}") so it fully
resets its selection state and prevents re-processing the same file.

In `@audio-rag/rag_code.py`:
- Around line 118-127: The ingest_data method currently clears all documents
from the shared collection without any scope or filtering, which causes data
from different users/sessions to interfere with each other. Add a source_id or
session identifier field to each document in the docs list comprehension
alongside the context and embedding field, then modify the delete_many call to
filter by this source_id instead of clearing all documents unconditionally. This
ensures that each upload only removes its own documents from the collection.
Additionally, update the retriever's query logic (the method that uses the
collection for retrieval around lines 138-146) to filter results by the same
source_id field so users only see their own data.
- Around line 252-256: The transcription pipeline continues to embedding and RAG
setup even when the transcripts are empty, resulting in an unusable query
engine. After creating the documents list from the transcripts in the line that
calls format_speaker, add a validation check to ensure the documents list is not
empty. If documents is empty, raise an appropriate exception to stop the
pipeline and prevent proceeding with the embed method call on the EmbedData
instance.
- Around line 122-125: The zip function in the docs list comprehension does not
use the strict parameter, which can silently truncate data if embeddata.contexts
and embeddata.embeddings have mismatched lengths. Since the project requires
Python 3.11 or later, modify the zip call to include strict=True as a parameter
to ensure an error is raised if the two iterables have different lengths,
preventing silent data loss during the collection clearing and embedding
operation.
- Around line 88-90: In the code where you check for existing vector indexes and
return early after verifying the index name matches self.VECTOR_INDEX_NAME, add
a call to self._wait_for_index_ready() before returning to ensure the index is
queryable. Additionally, in the _wait_for_index_ready() method, replace the
silent return when timeout expires with raising a TimeoutError so that setup
failures are detected immediately at configuration time rather than allowing
non-ready indexes to proceed to ingestion or retrieval operations.

In `@audio-rag/README.md`:
- Line 65: The image URL in the markdown link on line 65 references the
incorrect repository path. Update the URL to replace the repository reference
from `patchy631/ai-engineering` to `patchy631/ai-engineering-hub` to match the
correct repository where this PR resides. Ensure the full path to the image
resource, including the `resources/join_ddods.png` file, matches the actual
location in the correct repository so the image displays properly when rendered
on GitHub.

---

Nitpick comments:
In `@audio-rag/README.md`:
- Around line 62-71: The README contains multiple exclamation marks that create
an overly enthusiastic tone, reducing professionalism. Reduce the exclamation
marks throughout the newsletter section header "Stay Updated with Our
Newsletter!", the subscription copy text that includes "Stay in the loop with
the latest tutorials, insights, and exclusive resources", the call-to-action
"Subscribe now!", and the Contribution section text "Contributions are welcome!"
by either removing them entirely or replacing them with periods or no
punctuation to achieve a more measured and professional tone.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 22378a34-fbfb-4f8f-b998-91fc28f1781e

📥 Commits

Reviewing files that changed from the base of the PR and between cfd74dc and 8d99d50.

⛔ Files ignored due to path filters (5)
  • audio-rag/assets/deep-seek.png is excluded by !**/*.png
  • audio-rag/assets/speechmatics_logo.png is excluded by !**/*.png
  • audio-rag/assets/voyageai-mongodb-logo.png is excluded by !**/*.png
  • audio-rag/podcast.mp3 is excluded by !**/*.mp3
  • audio-rag/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • audio-rag/.env.example
  • audio-rag/README.md
  • audio-rag/app.py
  • audio-rag/pyproject.toml
  • audio-rag/rag_code.py

Comment thread audio-rag/app.py
Comment on lines +21 to +25
def reset_chat():
st.session_state.messages = []
st.session_state.transcripts = None
st.session_state.query_engine = None
st.session_state.processed_file_key = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the audio-rag directory and examine the app.py file
find . -type f -name "app.py" | grep -i audio

Repository: patchy631/ai-engineering-hub

Length of output: 149


🏁 Script executed:

# Get file size and basic structure
wc -l audio-rag/app.py 2>/dev/null || echo "File not found at that path"

Repository: patchy631/ai-engineering-hub

Length of output: 93


🏁 Script executed:

# Read the relevant sections of the file
echo "=== Lines 1-30 (including reset_chat function) ==="
head -30 audio-rag/app.py | cat -n

echo ""
echo "=== Lines 270-310 (additional mentioned sections) ==="
sed -n '270,310p' audio-rag/app.py | cat -n

Repository: patchy631/ai-engineering-hub

Length of output: 2969


🏁 Script executed:

# Let me examine the full context around the file_uploader and the reset_chat usage
echo "=== Lines 15-35 (initialization and reset_chat definition) ==="
sed -n '15,35p' audio-rag/app.py | cat -n

echo ""
echo "=== Full sidebar section to understand context (lines 270-310) ==="
sed -n '270,310p' audio-rag/app.py | cat -n

Repository: patchy631/ai-engineering-hub

Length of output: 2895


🏁 Script executed:

# Search for where reset_chat() is called
echo "=== Searching for reset_chat() calls ==="
grep -n "reset_chat()" audio-rag/app.py

# Search for button interactions that might trigger resets
echo ""
echo "=== Searching for st.button or st.sidebar interactions ==="
grep -n "st.button\|st.column\|st.chat" audio-rag/app.py | head -20

Repository: patchy631/ai-engineering-hub

Length of output: 583


🌐 Web query:

Streamlit file_uploader widget key persistence session_state behavior rerun reset

💡 Result:

In Streamlit, the st.file_uploader widget has specific behaviors regarding key persistence, session state, and resets [1][2]. Key Persistence and Reset Behavior When you provide a key to st.file_uploader, it stabilizes the widget's identity, allowing Streamlit to preserve the widget's value (the uploaded file) across standard script reruns [2]. However, the widget will reset—clearing the uploaded file—if any of the following occur: - The widget is removed from the script (e.g., hidden via conditional logic) [1]. - The key itself is changed [3]. - Specific parameters, such as type, accept_multiple_files, or max_upload_size, are modified, even if the key remains the same [2]. - The user manually resets the widget via the UI [4]. Session State Interaction - Access: You can access the uploaded file via st.session_state[key] as a read-only value [2]. - Modification Limitations: You cannot manually set or modify the value of an st.file_uploader via st.session_state (e.g., trying to force a file into the uploader programmatically) [5][6][7]. Attempting to do so will raise a StreamlitAPIException [6][7]. - Persistence Strategy: Because the file uploader widget itself is ephemeral, the best practice for persisting uploaded data across reruns or navigation (e.g., in multipage apps) is to store the uploaded file object or its processed contents into a separate variable within st.session_state immediately after it is uploaded [1][8]. Handling Reruns Streamlit reruns the entire script upon user interaction [1]. If an st.file_uploader is present, it will maintain its value across these reruns as long as the widget remains part of the execution flow [1][2]. If you encounter issues where the file seems to be re-processed or "re-uploaded" on every interaction, this is often because the processing logic is placed directly in the main execution flow; wrapping such logic in an on_change callback associated with the uploader or checking if the file is already processed in st.session_state can prevent redundant operations [4][9].

Citations:


🏁 Script executed:

# Check lines around where the Clear button is, to see the full context
echo "=== Lines 240-260 (Clear button and chat area) ==="
sed -n '240,260p' audio-rag/app.py | cat -n

# Check if there are any other interactions that might affect the flow
echo ""
echo "=== Lines 320-338 (chat input and response handling) ==="
sed -n '320,338p' audio-rag/app.py | cat -n

Repository: patchy631/ai-engineering-hub

Length of output: 2058


🏁 Script executed:

# Verify the exact line numbers for the additional references
echo "=== Lines 277-283 (file uploader declaration) ==="
sed -n '277,283p' audio-rag/app.py | cat -n

echo ""
echo "=== Lines 289-304 (file processing logic) ==="
sed -n '289,304p' audio-rag/app.py | cat -n

Repository: patchy631/ai-engineering-hub

Length of output: 1352


Clear button triggers unnecessary re-ingestion of the same audio file.

When the Clear button calls reset_chat(), it sets processed_file_key to None but the file uploader widget selection persists because its key remains static ("audio_uploader"). On the next rerun, uploaded_file still contains the previously uploaded file, causing the condition at line 289 to evaluate to True (since None != current file_key), which re-executes the expensive pipeline (transcription and MongoDB storage) for the same file.

💡 Suggested fix
if "messages" not in st.session_state:
    st.session_state.messages = []
+if "uploader_nonce" not in st.session_state:
+    st.session_state.uploader_nonce = 0

 def reset_chat():
     st.session_state.messages = []
     st.session_state.transcripts = None
     st.session_state.query_engine = None
     st.session_state.processed_file_key = None
+    st.session_state.uploader_nonce += 1

 uploaded_file = st.file_uploader(
     "Choose your audio file",
     type=["mp3", "wav", "m4a"],
     accept_multiple_files=False,
     label_visibility="collapsed",
-    key="audio_uploader",
+    key=f"audio_uploader_{st.session_state.uploader_nonce}",
 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/app.py` around lines 21 - 25, The reset_chat() function clears the
processed_file_key but the file uploader widget with static key "audio_uploader"
retains the previous file selection, causing the condition at line 289 to
re-ingest the same file. To fix this, add a session state counter variable that
increments each time the Clear button is pressed within reset_chat(), and use
this counter to make the file uploader widget key dynamic (e.g.,
"audio_uploader_{counter}") so it fully resets its selection state and prevents
re-processing the same file.

Comment thread audio-rag/app.py
Comment on lines +290 to +291
file_key = f"{uploaded_file.name}-{uploaded_file.size}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

name-size file key can reuse stale pipeline results.

Two different files can share the same name and size, causing transcript/query_engine reuse for the wrong audio.

🧩 Suggested fix
+import hashlib
 ...
-        file_key = f"{uploaded_file.name}-{uploaded_file.size}"
+        file_bytes = uploaded_file.getvalue()
+        file_hash = hashlib.sha256(file_bytes).hexdigest()
+        file_key = f"{uploaded_file.name}-{uploaded_file.size}-{file_hash}"
 ...
-                        f.write(uploaded_file.getvalue())
+                        f.write(file_bytes)

Also applies to: 297-297

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/app.py` around lines 290 - 291, The file_key generation in the
uploaded_file handling (line 290) uses only name and size, which can be
identical for different files, causing incorrect pipeline result reuse. Replace
the current file_key construction that combines uploaded_file.name and
uploaded_file.size with a more unique identifier by computing a hash of the file
contents (such as SHA256 or MD5) and incorporating that into the file_key. This
ensures different files, even with the same name and size, will have distinct
keys based on their actual content. Also apply the same fix to the other
location mentioned at line 297 where file_key is referenced.

Comment thread audio-rag/app.py
Comment on lines +295 to +297
file_path = os.path.join(temp_dir, uploaded_file.name)
with open(file_path, "wb") as f:
f.write(uploaded_file.getvalue())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Sanitize uploaded filename before writing to disk.

The uploaded filename is user-controlled and is joined directly into a filesystem path. A crafted filename can escape the temp dir (path traversal) or target unintended locations.

🔒 Suggested fix
-                    file_path = os.path.join(temp_dir, uploaded_file.name)
+                    safe_name = Path(uploaded_file.name).name
+                    file_path = os.path.join(temp_dir, safe_name)
                     with open(file_path, "wb") as f:
                         f.write(uploaded_file.getvalue())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
file_path = os.path.join(temp_dir, uploaded_file.name)
with open(file_path, "wb") as f:
f.write(uploaded_file.getvalue())
safe_name = Path(uploaded_file.name).name
file_path = os.path.join(temp_dir, safe_name)
with open(file_path, "wb") as f:
f.write(uploaded_file.getvalue())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/app.py` around lines 295 - 297, The uploaded filename from
uploaded_file.name is user-controlled and used directly in os.path.join()
without sanitization, which allows path traversal attacks where a crafted
filename like "../../../etc/passwd" could write files outside the intended
temp_dir. Sanitize the filename by applying os.path.basename() to
uploaded_file.name before passing it to os.path.join() to extract only the
filename component and remove any directory path traversal sequences.

Comment thread audio-rag/app.py
Comment on lines +306 to +308
except Exception as e:
st.error(f"An error occurred: {e}")
st.stop()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t expose raw exception text to end users.

Rendering str(e) directly in UI can leak internal/provider details. Show a generic error to users and log exception details server-side.

🛠 Suggested fix
+import logging
 ...
+logger = logging.getLogger(__name__)
 ...
-            except Exception as e:
-                st.error(f"An error occurred: {e}")
+            except Exception:
+                logger.exception("Failed to build RAG pipeline for uploaded audio")
+                st.error("Failed to process the audio file. Please try again.")
                 st.stop()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
st.error(f"An error occurred: {e}")
st.stop()
except Exception:
logger.exception("Failed to build RAG pipeline for uploaded audio")
st.error("Failed to process the audio file. Please try again.")
st.stop()
🧰 Tools
🪛 Ruff (0.15.17)

[warning] 306-306: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/app.py` around lines 306 - 308, The exception handler that catches
Exception as e and uses st.error with the raw exception message {e} is exposing
sensitive internal details to end users. Replace the error message shown to
users with a generic, user-friendly message (like "An error occurred while
processing your request. Please try again.") and separately log the actual
exception details (the full error string from variable e) using a server-side
logging mechanism that only developers can access, not the end user interface.

Source: Linters/SAST tools

Comment thread audio-rag/app.py
Comment on lines +330 to +338
with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
for chunk in st.session_state.query_engine.query(prompt):
full_response += chunk.delta or ""
message_placeholder.markdown(full_response + "▌")
message_placeholder.markdown(full_response)

st.session_state.messages.append({"role": "assistant", "content": full_response})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle streaming failures in chat response path.

query_engine.query(prompt) can fail mid-stream (network/provider errors). Without local handling, the run crashes and leaves message state inconsistent.

🧯 Suggested fix
     with st.chat_message("assistant"):
         message_placeholder = st.empty()
         full_response = ""
-        for chunk in st.session_state.query_engine.query(prompt):
-            full_response += chunk.delta or ""
-            message_placeholder.markdown(full_response + "▌")
+        try:
+            for chunk in st.session_state.query_engine.query(prompt):
+                full_response += chunk.delta or ""
+                message_placeholder.markdown(full_response + "▌")
+        except Exception:
+            if not full_response:
+                full_response = "I hit an error while generating a response. Please try again."
         message_placeholder.markdown(full_response)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
for chunk in st.session_state.query_engine.query(prompt):
full_response += chunk.delta or ""
message_placeholder.markdown(full_response + "▌")
message_placeholder.markdown(full_response)
st.session_state.messages.append({"role": "assistant", "content": full_response})
with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
try:
for chunk in st.session_state.query_engine.query(prompt):
full_response += chunk.delta or ""
message_placeholder.markdown(full_response + "▌")
except Exception:
if not full_response:
full_response = "I hit an error while generating a response. Please try again."
message_placeholder.markdown(full_response)
st.session_state.messages.append({"role": "assistant", "content": full_response})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/app.py` around lines 330 - 338, The streaming query loop that
processes chunks from st.session_state.query_engine.query(prompt) lacks error
handling for mid-stream failures (network or provider errors). Wrap the for loop
that iterates over the chunks in a try-except block to catch any exceptions that
occur during streaming, display an appropriate error message to the user using
message_placeholder, and ensure that the message state in
st.session_state.messages is handled consistently regardless of whether the
stream completes successfully or fails partway through.

Comment thread audio-rag/rag_code.py
Comment on lines +88 to +90
existing = list(self.collection.list_search_indexes())
if any(idx["name"] == self.VECTOR_INDEX_NAME for idx in existing):
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "rag_code.py" -type f

Repository: patchy631/ai-engineering-hub

Length of output: 311


🏁 Script executed:

cat -n audio-rag/rag_code.py | sed -n '80,125p'

Repository: patchy631/ai-engineering-hub

Length of output: 2214


Fail fast when the vector index is not queryable.

Line 90 returns immediately for existing indexes without verifying they are queryable, and Line 117 silently returns on timeout instead of raising an error. This allows non-ready indexes to proceed to ingestion/retrieval, causing failures to surface later during $vectorSearch operations instead of at setup time.

The fix requires two changes:

  1. Call self._wait_for_index_ready() after confirming an index exists (line 90)
  2. Raise TimeoutError in _wait_for_index_ready() when timeout expires (after line 116)
Proposed changes
         existing = list(self.collection.list_search_indexes())
         if any(idx["name"] == self.VECTOR_INDEX_NAME for idx in existing):
+            self._wait_for_index_ready()
             return
@@
     def _wait_for_index_ready(self, timeout=120):
         # Atlas builds search indexes asynchronously; poll until queryable.
         start = time.time()
         while time.time() - start < timeout:
             indexes = list(self.collection.list_search_indexes(self.VECTOR_INDEX_NAME))
             if indexes and indexes[0].get("queryable"):
                 return
             time.sleep(2)
+        raise TimeoutError(f"Vector search index {self.VECTOR_INDEX_NAME!r} was not queryable after {timeout}s")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/rag_code.py` around lines 88 - 90, In the code where you check for
existing vector indexes and return early after verifying the index name matches
self.VECTOR_INDEX_NAME, add a call to self._wait_for_index_ready() before
returning to ensure the index is queryable. Additionally, in the
_wait_for_index_ready() method, replace the silent return when timeout expires
with raising a TimeoutError so that setup failures are detected immediately at
configuration time rather than allowing non-ready indexes to proceed to
ingestion or retrieval operations.

Comment thread audio-rag/rag_code.py
Comment on lines +118 to +127
def ingest_data(self, embeddata):
# Clear any previous data for this demo collection so re-uploads don't duplicate.
self.collection.delete_many({})

docs = [
{"context": context, self.EMBEDDING_FIELD: embedding}
for context, embedding in zip(embeddata.contexts, embeddata.embeddings)
]
for batch in batch_iterate(docs, INGEST_BATCH_SIZE):
self.collection.insert_many(batch)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Scope vector documents per upload/session before deleting or retrieving.

Every pipeline instance uses the same collection, but Line 120 clears all documents and the retriever has no per-upload filter. In the Streamlit flow, a second upload can replace the data behind an existing session’s query_engine, causing cross-user transcript leakage or wrong answers.

🔒 Suggested direction
+import uuid
@@
     def __init__(
         self,
         db_name=DB_NAME,
         collection_name=COLLECTION_NAME,
         vector_dim=VECTOR_DIM,
+        source_id=None,
     ):
         self.db_name = db_name
         self.collection_name = collection_name
         self.vector_dim = vector_dim
+        self.source_id = source_id or uuid.uuid4().hex
@@
                 "fields": [
                     {
                         "type": "vector",
                         "path": self.EMBEDDING_FIELD,
                         "numDimensions": self.vector_dim,
                         "similarity": "dotProduct",
-                    }
+                    },
+                    {"type": "filter", "path": "source_id"},
                 ]
@@
-        self.collection.delete_many({})
+        self.collection.delete_many({"source_id": self.source_id})
@@
-            {"context": context, self.EMBEDDING_FIELD: embedding}
+            {"context": context, "source_id": self.source_id, self.EMBEDDING_FIELD: embedding}
             for context, embedding in zip(embeddata.contexts, embeddata.embeddings)
@@
                 "$vectorSearch": {
                     "index": self.vector_db.VECTOR_INDEX_NAME,
                     "path": self.vector_db.EMBEDDING_FIELD,
                     "queryVector": query_embedding,
                     "numCandidates": num_candidates,
                     "limit": top_k,
+                    "filter": {"source_id": self.vector_db.source_id},
                 }

If the index already exists, this also needs an index migration/recreate path so the source_id filter field is available.

Also applies to: 138-146

🧰 Tools
🪛 Ruff (0.15.17)

[warning] 124-124: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/rag_code.py` around lines 118 - 127, The ingest_data method
currently clears all documents from the shared collection without any scope or
filtering, which causes data from different users/sessions to interfere with
each other. Add a source_id or session identifier field to each document in the
docs list comprehension alongside the context and embedding field, then modify
the delete_many call to filter by this source_id instead of clearing all
documents unconditionally. This ensures that each upload only removes its own
documents from the collection. Additionally, update the retriever's query logic
(the method that uses the collection for retrieval around lines 138-146) to
filter results by the same source_id field so users only see their own data.

Comment thread audio-rag/rag_code.py
Comment on lines +122 to +125
docs = [
{"context": context, self.EMBEDDING_FIELD: embedding}
for context, embedding in zip(embeddata.contexts, embeddata.embeddings)
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Inspect Python version declarations relevant to zip(strict=True).
fd -HI '^pyproject\.toml$' -x sh -c 'printf "\n== %s ==\n" "$1"; grep -nE "requires-python|target-version" "$1" || true' sh {}

Repository: patchy631/ai-engineering-hub

Length of output: 3695


Add strict=True to prevent silent data truncation.

Line 124 silently truncates if embeddings and contexts have mismatched counts, which risks data loss since the collection is already cleared. The project requires Python ≥3.11, which supports zip(strict=True).

🐛 Proposed fix
         docs = [
             {"context": context, self.EMBEDDING_FIELD: embedding}
-            for context, embedding in zip(embeddata.contexts, embeddata.embeddings)
+            for context, embedding in zip(embeddata.contexts, embeddata.embeddings, strict=True)
         ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
docs = [
{"context": context, self.EMBEDDING_FIELD: embedding}
for context, embedding in zip(embeddata.contexts, embeddata.embeddings)
]
docs = [
{"context": context, self.EMBEDDING_FIELD: embedding}
for context, embedding in zip(embeddata.contexts, embeddata.embeddings, strict=True)
]
🧰 Tools
🪛 Ruff (0.15.17)

[warning] 124-124: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/rag_code.py` around lines 122 - 125, The zip function in the docs
list comprehension does not use the strict parameter, which can silently
truncate data if embeddata.contexts and embeddata.embeddings have mismatched
lengths. Since the project requires Python 3.11 or later, modify the zip call to
include strict=True as a parameter to ensure an error is raised if the two
iterables have different lengths, preventing silent data loss during the
collection clearing and embedding operation.

Source: Linters/SAST tools

Comment thread audio-rag/rag_code.py
Comment on lines +252 to +256
transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path)
documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts]

embeddata = EmbedData()
embeddata.embed(documents)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stop the pipeline when transcription produces no documents.

If the audio has no transcribed turns, this still proceeds into embedding/RAG setup and can return a query engine with no retrievable context.

🐛 Proposed fix
     transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path)
-    documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts]
+    documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts if t["text"].strip()]
+    if not documents:
+        raise ValueError("No transcribable speech was found in the uploaded audio.")
 
     embeddata = EmbedData()
     embeddata.embed(documents)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path)
documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts]
embeddata = EmbedData()
embeddata.embed(documents)
transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path)
documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts if t["text"].strip()]
if not documents:
raise ValueError("No transcribable speech was found in the uploaded audio.")
embeddata = EmbedData()
embeddata.embed(documents)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/rag_code.py` around lines 252 - 256, The transcription pipeline
continues to embedding and RAG setup even when the transcripts are empty,
resulting in an unusable query engine. After creating the documents list from
the transcripts in the line that calls format_speaker, add a validation check to
ensure the documents list is not empty. If documents is empty, raise an
appropriate exception to stop the pipeline and prevent proceeding with the embed
method call on the EmbedData instance.

Comment thread audio-rag/README.md
## 📬 Stay Updated with Our Newsletter!
**Get a FREE Data Science eBook** 📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! Stay in the loop with the latest tutorials, insights, and exclusive resources. [Subscribe now!](https://join.dailydoseofds.com)

[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Verify the image URL references the correct repository.

The image link references patchy631/ai-engineering/blob/main/resources/..., but the PR is in patchy631/ai-engineering-hub. This path mismatch will likely result in a broken image (404) when rendered on GitHub, preventing the newsletter call-to-action from displaying correctly.

🔗 Correcting the image URL
-[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)
+[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering-hub/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)

Alternatively, verify the correct repository and path where this image resource actually resides.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)
[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering-hub/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/README.md` at line 65, The image URL in the markdown link on line
65 references the incorrect repository path. Update the URL to replace the
repository reference from `patchy631/ai-engineering` to
`patchy631/ai-engineering-hub` to match the correct repository where this PR
resides. Ensure the full path to the image resource, including the
`resources/join_ddods.png` file, matches the actual location in the correct
repository so the image displays properly when rendered on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant