Add audio RAG pipeline by namanvirk18 · Pull Request #247 · patchy631/ai-engineering-hub

namanvirk18 · 2026-06-19T14:14:40Z

Summary by CodeRabbit

New Features
- Launched a new audio RAG application enabling users to upload audio files (mp3/wav/m4a) and ask questions via chat-based Q&A.
- Integrated automatic audio transcription with speaker labels and diarization.
- Added an in-app audio player and formatted transcript display for easy reference during conversations.

coderabbitai · 2026-06-19T14:15:10Z

📝 Walkthrough

Walkthrough

A new audio-rag project is added, implementing a full RAG pipeline over audio. It transcribes audio with Speechmatics (speaker diarization), embeds speaker-turn documents via Voyage AI, stores and retrieves them from MongoDB Atlas Vector Search, streams answers from an OpenRouter LLM, and exposes the entire workflow through a Streamlit chat UI.

Changes

Audio RAG Application

Layer / File(s)	Summary
Project config, env, and docs `audio-rag/pyproject.toml`, `audio-rag/.env.example`, `audio-rag/README.md`	Defines runtime dependencies and uv settings in `pyproject.toml`, supplies `.env.example` placeholders for Speechmatics, Voyage AI, MongoDB, and OpenRouter, and documents setup/run steps in `README.md`.
Audio transcription and speaker formatting `audio-rag/rag_code.py` (lines 1–31, 191–248)	Declares module constants and batch-iterate utility; implements `Transcribe` using the Speechmatics async batch API to group diarized word results into consecutive speaker turns; adds `format_speaker` to map `S1`/`S2` tags to `Speaker A`/`Speaker B` labels.
Voyage AI embedding and MongoDB vector store `audio-rag/rag_code.py` (lines 33–128)	Defines `EmbedData` wrapping the Voyage AI client for batch document and single-query embedding; defines `MongoVDB` with collection creation, vector search index setup with readiness polling, data clearing, and batched document ingestion.
Retriever, RAG, and pipeline orchestrator `audio-rag/rag_code.py` (lines 130–189, 250–264)	Implements `Retriever.search` executing a MongoDB `$vectorSearch` aggregation; implements `RAG` with context-assembly prompt and OpenRouter LLM streaming; `build_rag_pipeline` chains transcription → embedding → ingestion → retrieval setup and returns a `RAG` instance with raw transcripts.
Streamlit app `audio-rag/app.py`	Initializes session state and `reset_chat`; adds `_render_transcript`, `_logo_data_uri`, and `_inject_styles` helpers; implements `render_header` with logos and clear button; sidebar upload flow validates file size, keys by `name-size`, calls `build_rag_pipeline` in a spinner, and shows audio player and transcript; chat section renders history and streams assistant responses chunk-by-chunk.

Sequence Diagram(s)

sequenceDiagram
  actor User
  participant Streamlit as Streamlit App
  participant build_rag_pipeline
  participant Transcribe
  participant SpeechmaticsAPI
  participant EmbedData as EmbedData (Voyage AI)
  participant MongoVDB as MongoVDB (Atlas)
  participant RAG as RAG (OpenRouter)

  User->>Streamlit: upload audio file (mp3/wav/m4a)
  Streamlit->>build_rag_pipeline: build_rag_pipeline(tmp_file_path)
  build_rag_pipeline->>Transcribe: transcribe_audio(path)
  Transcribe->>SpeechmaticsAPI: async batch submit + poll
  SpeechmaticsAPI-->>Transcribe: diarized word results
  Transcribe-->>build_rag_pipeline: List[{speaker, text}]
  build_rag_pipeline->>EmbedData: embed(speaker_turn_docs)
  EmbedData-->>build_rag_pipeline: document vectors
  build_rag_pipeline->>MongoVDB: create_collection + ingest_data
  MongoVDB-->>build_rag_pipeline: ingestion complete
  build_rag_pipeline-->>Streamlit: RAG instance + transcripts
  Streamlit-->>User: audio player + transcript shown
  User->>Streamlit: enter chat prompt
  Streamlit->>RAG: query(prompt)
  RAG->>MongoVDB: $vectorSearch (via Retriever)
  MongoVDB-->>RAG: top-k transcript contexts
  RAG-->>Streamlit: streamed LLM delta chunks
  Streamlit-->>User: streamed answer rendered

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 A bunny hopped in, mic in paw,
Transcribed each speaker without a flaw.
Embeddings flew to Mongo's store,
OpenRouter answered, chunk by chunk galore.
"Who said what?"—the RAG replied,
With Atlas vectors and Voyage AI guide! 🎙️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add audio RAG pipeline' directly and clearly describes the main change—introducing a new audio-based RAG (Retrieval-Augmented Generation) system with all supporting components (transcription, embeddings, vector store, LLM integration, and UI).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (1)

audio-rag/README.md (1)
62-71: 💤 Low value

Minor style note: Excessive exclamation marks in the document.

The README contains several exclamation marks (approximately 4 across the section headers and contribution line). Consider toning this down slightly for a more professional tone.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@audio-rag/README.md` around lines 62 - 71, The README contains multiple
exclamation marks that create an overly enthusiastic tone, reducing
professionalism. Reduce the exclamation marks throughout the newsletter section
header "Stay Updated with Our Newsletter!", the subscription copy text that
includes "Stay in the loop with the latest tutorials, insights, and exclusive
resources", the call-to-action "Subscribe now!", and the Contribution section
text "Contributions are welcome!" by either removing them entirely or replacing
them with periods or no punctuation to achieve a more measured and professional
tone.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@audio-rag/app.py`:
- Around line 290-291: The file_key generation in the uploaded_file handling
(line 290) uses only name and size, which can be identical for different files,
causing incorrect pipeline result reuse. Replace the current file_key
construction that combines uploaded_file.name and uploaded_file.size with a more
unique identifier by computing a hash of the file contents (such as SHA256 or
MD5) and incorporating that into the file_key. This ensures different files,
even with the same name and size, will have distinct keys based on their actual
content. Also apply the same fix to the other location mentioned at line 297
where file_key is referenced.
- Around line 330-338: The streaming query loop that processes chunks from
st.session_state.query_engine.query(prompt) lacks error handling for mid-stream
failures (network or provider errors). Wrap the for loop that iterates over the
chunks in a try-except block to catch any exceptions that occur during
streaming, display an appropriate error message to the user using
message_placeholder, and ensure that the message state in
st.session_state.messages is handled consistently regardless of whether the
stream completes successfully or fails partway through.
- Around line 295-297: The uploaded filename from uploaded_file.name is
user-controlled and used directly in os.path.join() without sanitization, which
allows path traversal attacks where a crafted filename like
"../../../etc/passwd" could write files outside the intended temp_dir. Sanitize
the filename by applying os.path.basename() to uploaded_file.name before passing
it to os.path.join() to extract only the filename component and remove any
directory path traversal sequences.
- Around line 306-308: The exception handler that catches Exception as e and
uses st.error with the raw exception message {e} is exposing sensitive internal
details to end users. Replace the error message shown to users with a generic,
user-friendly message (like "An error occurred while processing your request.
Please try again.") and separately log the actual exception details (the full
error string from variable e) using a server-side logging mechanism that only
developers can access, not the end user interface.
- Around line 21-25: The reset_chat() function clears the processed_file_key but
the file uploader widget with static key "audio_uploader" retains the previous
file selection, causing the condition at line 289 to re-ingest the same file. To
fix this, add a session state counter variable that increments each time the
Clear button is pressed within reset_chat(), and use this counter to make the
file uploader widget key dynamic (e.g., "audio_uploader_{counter}") so it fully
resets its selection state and prevents re-processing the same file.

In `@audio-rag/rag_code.py`:
- Around line 118-127: The ingest_data method currently clears all documents
from the shared collection without any scope or filtering, which causes data
from different users/sessions to interfere with each other. Add a source_id or
session identifier field to each document in the docs list comprehension
alongside the context and embedding field, then modify the delete_many call to
filter by this source_id instead of clearing all documents unconditionally. This
ensures that each upload only removes its own documents from the collection.
Additionally, update the retriever's query logic (the method that uses the
collection for retrieval around lines 138-146) to filter results by the same
source_id field so users only see their own data.
- Around line 252-256: The transcription pipeline continues to embedding and RAG
setup even when the transcripts are empty, resulting in an unusable query
engine. After creating the documents list from the transcripts in the line that
calls format_speaker, add a validation check to ensure the documents list is not
empty. If documents is empty, raise an appropriate exception to stop the
pipeline and prevent proceeding with the embed method call on the EmbedData
instance.
- Around line 122-125: The zip function in the docs list comprehension does not
use the strict parameter, which can silently truncate data if embeddata.contexts
and embeddata.embeddings have mismatched lengths. Since the project requires
Python 3.11 or later, modify the zip call to include strict=True as a parameter
to ensure an error is raised if the two iterables have different lengths,
preventing silent data loss during the collection clearing and embedding
operation.
- Around line 88-90: In the code where you check for existing vector indexes and
return early after verifying the index name matches self.VECTOR_INDEX_NAME, add
a call to self._wait_for_index_ready() before returning to ensure the index is
queryable. Additionally, in the _wait_for_index_ready() method, replace the
silent return when timeout expires with raising a TimeoutError so that setup
failures are detected immediately at configuration time rather than allowing
non-ready indexes to proceed to ingestion or retrieval operations.

In `@audio-rag/README.md`:
- Line 65: The image URL in the markdown link on line 65 references the
incorrect repository path. Update the URL to replace the repository reference
from `patchy631/ai-engineering` to `patchy631/ai-engineering-hub` to match the
correct repository where this PR resides. Ensure the full path to the image
resource, including the `resources/join_ddods.png` file, matches the actual
location in the correct repository so the image displays properly when rendered
on GitHub.

---

Nitpick comments:
In `@audio-rag/README.md`:
- Around line 62-71: The README contains multiple exclamation marks that create
an overly enthusiastic tone, reducing professionalism. Reduce the exclamation
marks throughout the newsletter section header "Stay Updated with Our
Newsletter!", the subscription copy text that includes "Stay in the loop with
the latest tutorials, insights, and exclusive resources", the call-to-action
"Subscribe now!", and the Contribution section text "Contributions are welcome!"
by either removing them entirely or replacing them with periods or no
punctuation to achieve a more measured and professional tone.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 22378a34-fbfb-4f8f-b998-91fc28f1781e

📥 Commits

Reviewing files that changed from the base of the PR and between cfd74dc and 8d99d50.

⛔ Files ignored due to path filters (5)

audio-rag/assets/deep-seek.png is excluded by !**/*.png
audio-rag/assets/speechmatics_logo.png is excluded by !**/*.png
audio-rag/assets/voyageai-mongodb-logo.png is excluded by !**/*.png
audio-rag/podcast.mp3 is excluded by !**/*.mp3
audio-rag/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

audio-rag/.env.example
audio-rag/README.md
audio-rag/app.py
audio-rag/pyproject.toml
audio-rag/rag_code.py

coderabbitai · 2026-06-19T14:22:29Z

+def reset_chat():
+    st.session_state.messages = []
+    st.session_state.transcripts = None
+    st.session_state.query_engine = None
+    st.session_state.processed_file_key = None


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the audio-rag directory and examine the app.py file find . -type f -name "app.py" | grep -i audio

Repository: patchy631/ai-engineering-hub

Length of output: 149

🏁 Script executed:

# Get file size and basic structure wc -l audio-rag/app.py 2>/dev/null || echo "File not found at that path"

Repository: patchy631/ai-engineering-hub

Length of output: 93

🏁 Script executed:

# Read the relevant sections of the file echo "=== Lines 1-30 (including reset_chat function) ===" head -30 audio-rag/app.py | cat -n echo "" echo "=== Lines 270-310 (additional mentioned sections) ===" sed -n '270,310p' audio-rag/app.py | cat -n

Repository: patchy631/ai-engineering-hub

Length of output: 2969

🏁 Script executed:

# Let me examine the full context around the file_uploader and the reset_chat usage echo "=== Lines 15-35 (initialization and reset_chat definition) ===" sed -n '15,35p' audio-rag/app.py | cat -n echo "" echo "=== Full sidebar section to understand context (lines 270-310) ===" sed -n '270,310p' audio-rag/app.py | cat -n

Repository: patchy631/ai-engineering-hub

Length of output: 2895

🏁 Script executed:

# Search for where reset_chat() is called echo "=== Searching for reset_chat() calls ===" grep -n "reset_chat()" audio-rag/app.py # Search for button interactions that might trigger resets echo "" echo "=== Searching for st.button or st.sidebar interactions ===" grep -n "st.button\|st.column\|st.chat" audio-rag/app.py | head -20

Repository: patchy631/ai-engineering-hub

Length of output: 583

🌐 Web query:

Streamlit file_uploader widget key persistence session_state behavior rerun reset

💡 Result:

In Streamlit, the st.file_uploader widget has specific behaviors regarding key persistence, session state, and resets [1][2]. Key Persistence and Reset Behavior When you provide a key to st.file_uploader, it stabilizes the widget's identity, allowing Streamlit to preserve the widget's value (the uploaded file) across standard script reruns [2]. However, the widget will reset—clearing the uploaded file—if any of the following occur: - The widget is removed from the script (e.g., hidden via conditional logic) [1]. - The key itself is changed [3]. - Specific parameters, such as type, accept_multiple_files, or max_upload_size, are modified, even if the key remains the same [2]. - The user manually resets the widget via the UI [4]. Session State Interaction - Access: You can access the uploaded file via st.session_state[key] as a read-only value [2]. - Modification Limitations: You cannot manually set or modify the value of an st.file_uploader via st.session_state (e.g., trying to force a file into the uploader programmatically) [5][6][7]. Attempting to do so will raise a StreamlitAPIException [6][7]. - Persistence Strategy: Because the file uploader widget itself is ephemeral, the best practice for persisting uploaded data across reruns or navigation (e.g., in multipage apps) is to store the uploaded file object or its processed contents into a separate variable within st.session_state immediately after it is uploaded [1][8]. Handling Reruns Streamlit reruns the entire script upon user interaction [1]. If an st.file_uploader is present, it will maintain its value across these reruns as long as the widget remains part of the execution flow [1][2]. If you encounter issues where the file seems to be re-processed or "re-uploaded" on every interaction, this is often because the processing logic is placed directly in the main execution flow; wrapping such logic in an on_change callback associated with the uploader or checking if the file is already processed in st.session_state can prevent redundant operations [4][9].

Citations:

1: https://github.com/streamlit/docs/blob/main/content/develop/concepts/architecture/widget-behavior.md

2: https://docs.streamlit.io/develop/api-reference/widgets/st.file_uploader

3: https://discuss.streamlit.io/t/does-switching-file-uploader-keys-reset-uploaded-files/119465

4: file_uploader triggering second time when we click on other buttons in the same page. streamlit/streamlit#3838

5: https://github.com/streamlit/docs/blob/main/content/develop/api-reference/caching-and-state/session_state.md

6: https://github.com/streamlit/docs/blob/main/content/develop/concepts/architecture/session-state.md

7: https://docs.streamlit.io/develop/api-reference/caching-and-state/st.session_state

8: https://discuss.streamlit.io/t/how-to-use-session-state-to-save-file-uploads-and-filters/36443

9: 🎬 : Improved file uploader and camera input to call its on_change handler only when necessary streamlit/streamlit#4270

🏁 Script executed:

# Check lines around where the Clear button is, to see the full context echo "=== Lines 240-260 (Clear button and chat area) ===" sed -n '240,260p' audio-rag/app.py | cat -n # Check if there are any other interactions that might affect the flow echo "" echo "=== Lines 320-338 (chat input and response handling) ===" sed -n '320,338p' audio-rag/app.py | cat -n

Repository: patchy631/ai-engineering-hub

Length of output: 2058

🏁 Script executed:

# Verify the exact line numbers for the additional references echo "=== Lines 277-283 (file uploader declaration) ===" sed -n '277,283p' audio-rag/app.py | cat -n echo "" echo "=== Lines 289-304 (file processing logic) ===" sed -n '289,304p' audio-rag/app.py | cat -n

Repository: patchy631/ai-engineering-hub

Length of output: 1352

Clear button triggers unnecessary re-ingestion of the same audio file.

When the Clear button calls reset_chat(), it sets processed_file_key to None but the file uploader widget selection persists because its key remains static ("audio_uploader"). On the next rerun, uploaded_file still contains the previously uploaded file, causing the condition at line 289 to evaluate to True (since None != current file_key), which re-executes the expensive pipeline (transcription and MongoDB storage) for the same file.

💡 Suggested fix

if "messages" not in st.session_state: st.session_state.messages = [] +if "uploader_nonce" not in st.session_state: + st.session_state.uploader_nonce = 0 def reset_chat(): st.session_state.messages = [] st.session_state.transcripts = None st.session_state.query_engine = None st.session_state.processed_file_key = None + st.session_state.uploader_nonce += 1 uploaded_file = st.file_uploader( "Choose your audio file", type=["mp3", "wav", "m4a"], accept_multiple_files=False, label_visibility="collapsed", - key="audio_uploader", + key=f"audio_uploader_{st.session_state.uploader_nonce}", )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/app.py` around lines 21 - 25, The reset_chat() function clears the processed_file_key but the file uploader widget with static key "audio_uploader" retains the previous file selection, causing the condition at line 289 to re-ingest the same file. To fix this, add a session state counter variable that increments each time the Clear button is pressed within reset_chat(), and use this counter to make the file uploader widget key dynamic (e.g., "audio_uploader_{counter}") so it fully resets its selection state and prevents re-processing the same file.

coderabbitai · 2026-06-19T14:22:29Z

+        file_key = f"{uploaded_file.name}-{uploaded_file.size}"
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

name-size file key can reuse stale pipeline results.

Two different files can share the same name and size, causing transcript/query_engine reuse for the wrong audio.

🧩 Suggested fix

+import hashlib ... - file_key = f"{uploaded_file.name}-{uploaded_file.size}" + file_bytes = uploaded_file.getvalue() + file_hash = hashlib.sha256(file_bytes).hexdigest() + file_key = f"{uploaded_file.name}-{uploaded_file.size}-{file_hash}" ... - f.write(uploaded_file.getvalue()) + f.write(file_bytes)

Also applies to: 297-297

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/app.py` around lines 290 - 291, The file_key generation in the uploaded_file handling (line 290) uses only name and size, which can be identical for different files, causing incorrect pipeline result reuse. Replace the current file_key construction that combines uploaded_file.name and uploaded_file.size with a more unique identifier by computing a hash of the file contents (such as SHA256 or MD5) and incorporating that into the file_key. This ensures different files, even with the same name and size, will have distinct keys based on their actual content. Also apply the same fix to the other location mentioned at line 297 where file_key is referenced.

coderabbitai · 2026-06-19T14:22:29Z

+                    file_path = os.path.join(temp_dir, uploaded_file.name)
+                    with open(file_path, "wb") as f:
+                        f.write(uploaded_file.getvalue())


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Sanitize uploaded filename before writing to disk.

The uploaded filename is user-controlled and is joined directly into a filesystem path. A crafted filename can escape the temp dir (path traversal) or target unintended locations.

🔒 Suggested fix

- file_path = os.path.join(temp_dir, uploaded_file.name) + safe_name = Path(uploaded_file.name).name + file_path = os.path.join(temp_dir, safe_name) with open(file_path, "wb") as f: f.write(uploaded_file.getvalue())

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

file_path = os.path.join(temp_dir, uploaded_file.name)

with open(file_path, "wb") as f:

f.write(uploaded_file.getvalue())

safe_name = Path(uploaded_file.name).name

file_path = os.path.join(temp_dir, safe_name)

with open(file_path, "wb") as f:

f.write(uploaded_file.getvalue())

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/app.py` around lines 295 - 297, The uploaded filename from uploaded_file.name is user-controlled and used directly in os.path.join() without sanitization, which allows path traversal attacks where a crafted filename like "../../../etc/passwd" could write files outside the intended temp_dir. Sanitize the filename by applying os.path.basename() to uploaded_file.name before passing it to os.path.join() to extract only the filename component and remove any directory path traversal sequences.

coderabbitai · 2026-06-19T14:22:30Z

+            except Exception as e:
+                st.error(f"An error occurred: {e}")
+                st.stop()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t expose raw exception text to end users.

Rendering str(e) directly in UI can leak internal/provider details. Show a generic error to users and log exception details server-side.

🛠 Suggested fix

+import logging ... +logger = logging.getLogger(__name__) ... - except Exception as e: - st.error(f"An error occurred: {e}") + except Exception: + logger.exception("Failed to build RAG pipeline for uploaded audio") + st.error("Failed to process the audio file. Please try again.") st.stop()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

st.error(f"An error occurred: {e}")

st.stop()

except Exception:

logger.exception("Failed to build RAG pipeline for uploaded audio")

st.error("Failed to process the audio file. Please try again.")

st.stop()

🧰 Tools

🪛 Ruff (0.15.17)

[warning] 306-306: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/app.py` around lines 306 - 308, The exception handler that catches Exception as e and uses st.error with the raw exception message {e} is exposing sensitive internal details to end users. Replace the error message shown to users with a generic, user-friendly message (like "An error occurred while processing your request. Please try again.") and separately log the actual exception details (the full error string from variable e) using a server-side logging mechanism that only developers can access, not the end user interface.

Source: Linters/SAST tools

coderabbitai · 2026-06-19T14:22:30Z

+    with st.chat_message("assistant"):
+        message_placeholder = st.empty()
+        full_response = ""
+        for chunk in st.session_state.query_engine.query(prompt):
+            full_response += chunk.delta or ""
+            message_placeholder.markdown(full_response + "▌")
+        message_placeholder.markdown(full_response)
+
+    st.session_state.messages.append({"role": "assistant", "content": full_response})


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle streaming failures in chat response path.

query_engine.query(prompt) can fail mid-stream (network/provider errors). Without local handling, the run crashes and leaves message state inconsistent.

🧯 Suggested fix

with st.chat_message("assistant"): message_placeholder = st.empty() full_response = "" - for chunk in st.session_state.query_engine.query(prompt): - full_response += chunk.delta or "" - message_placeholder.markdown(full_response + "▌") + try: + for chunk in st.session_state.query_engine.query(prompt): + full_response += chunk.delta or "" + message_placeholder.markdown(full_response + "▌") + except Exception: + if not full_response: + full_response = "I hit an error while generating a response. Please try again." message_placeholder.markdown(full_response)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with st.chat_message("assistant"):

message_placeholder = st.empty()

full_response = ""

for chunk in st.session_state.query_engine.query(prompt):

full_response += chunk.delta or ""

message_placeholder.markdown(full_response + "▌")

message_placeholder.markdown(full_response)

st.session_state.messages.append({"role": "assistant", "content": full_response})

with st.chat_message("assistant"):

message_placeholder = st.empty()

full_response = ""

try:

for chunk in st.session_state.query_engine.query(prompt):

full_response += chunk.delta or ""

message_placeholder.markdown(full_response + "▌")

except Exception:

if not full_response:

full_response = "I hit an error while generating a response. Please try again."

message_placeholder.markdown(full_response)

st.session_state.messages.append({"role": "assistant", "content": full_response})

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/app.py` around lines 330 - 338, The streaming query loop that processes chunks from st.session_state.query_engine.query(prompt) lacks error handling for mid-stream failures (network or provider errors). Wrap the for loop that iterates over the chunks in a try-except block to catch any exceptions that occur during streaming, display an appropriate error message to the user using message_placeholder, and ensure that the message state in st.session_state.messages is handled consistently regardless of whether the stream completes successfully or fails partway through.

coderabbitai · 2026-06-19T14:22:30Z

+        existing = list(self.collection.list_search_indexes())
+        if any(idx["name"] == self.VECTOR_INDEX_NAME for idx in existing):
+            return


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "rag_code.py" -type f

Repository: patchy631/ai-engineering-hub

Length of output: 311

🏁 Script executed:

cat -n audio-rag/rag_code.py | sed -n '80,125p'

Repository: patchy631/ai-engineering-hub

Length of output: 2214

Fail fast when the vector index is not queryable.

Line 90 returns immediately for existing indexes without verifying they are queryable, and Line 117 silently returns on timeout instead of raising an error. This allows non-ready indexes to proceed to ingestion/retrieval, causing failures to surface later during $vectorSearch operations instead of at setup time.

The fix requires two changes:

Call self._wait_for_index_ready() after confirming an index exists (line 90)

Raise TimeoutError in _wait_for_index_ready() when timeout expires (after line 116)

Proposed changes

existing = list(self.collection.list_search_indexes()) if any(idx["name"] == self.VECTOR_INDEX_NAME for idx in existing): + self._wait_for_index_ready() return @@ def _wait_for_index_ready(self, timeout=120): # Atlas builds search indexes asynchronously; poll until queryable. start = time.time() while time.time() - start < timeout: indexes = list(self.collection.list_search_indexes(self.VECTOR_INDEX_NAME)) if indexes and indexes[0].get("queryable"): return time.sleep(2) + raise TimeoutError(f"Vector search index {self.VECTOR_INDEX_NAME!r} was not queryable after {timeout}s")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/rag_code.py` around lines 88 - 90, In the code where you check for existing vector indexes and return early after verifying the index name matches self.VECTOR_INDEX_NAME, add a call to self._wait_for_index_ready() before returning to ensure the index is queryable. Additionally, in the _wait_for_index_ready() method, replace the silent return when timeout expires with raising a TimeoutError so that setup failures are detected immediately at configuration time rather than allowing non-ready indexes to proceed to ingestion or retrieval operations.

coderabbitai · 2026-06-19T14:22:30Z

+    def ingest_data(self, embeddata):
+        # Clear any previous data for this demo collection so re-uploads don't duplicate.
+        self.collection.delete_many({})
+
+        docs = [
+            {"context": context, self.EMBEDDING_FIELD: embedding}
+            for context, embedding in zip(embeddata.contexts, embeddata.embeddings)
+        ]
+        for batch in batch_iterate(docs, INGEST_BATCH_SIZE):
+            self.collection.insert_many(batch)


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Scope vector documents per upload/session before deleting or retrieving.

Every pipeline instance uses the same collection, but Line 120 clears all documents and the retriever has no per-upload filter. In the Streamlit flow, a second upload can replace the data behind an existing session’s query_engine, causing cross-user transcript leakage or wrong answers.

🔒 Suggested direction

+import uuid @@ def __init__( self, db_name=DB_NAME, collection_name=COLLECTION_NAME, vector_dim=VECTOR_DIM, + source_id=None, ): self.db_name = db_name self.collection_name = collection_name self.vector_dim = vector_dim + self.source_id = source_id or uuid.uuid4().hex @@ "fields": [ { "type": "vector", "path": self.EMBEDDING_FIELD, "numDimensions": self.vector_dim, "similarity": "dotProduct", - } + }, + {"type": "filter", "path": "source_id"}, ] @@ - self.collection.delete_many({}) + self.collection.delete_many({"source_id": self.source_id}) @@ - {"context": context, self.EMBEDDING_FIELD: embedding} + {"context": context, "source_id": self.source_id, self.EMBEDDING_FIELD: embedding} for context, embedding in zip(embeddata.contexts, embeddata.embeddings) @@ "$vectorSearch": { "index": self.vector_db.VECTOR_INDEX_NAME, "path": self.vector_db.EMBEDDING_FIELD, "queryVector": query_embedding, "numCandidates": num_candidates, "limit": top_k, + "filter": {"source_id": self.vector_db.source_id}, }

If the index already exists, this also needs an index migration/recreate path so the source_id filter field is available.

Also applies to: 138-146

🧰 Tools

🪛 Ruff (0.15.17)

[warning] 124-124: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/rag_code.py` around lines 118 - 127, The ingest_data method currently clears all documents from the shared collection without any scope or filtering, which causes data from different users/sessions to interfere with each other. Add a source_id or session identifier field to each document in the docs list comprehension alongside the context and embedding field, then modify the delete_many call to filter by this source_id instead of clearing all documents unconditionally. This ensures that each upload only removes its own documents from the collection. Additionally, update the retriever's query logic (the method that uses the collection for retrieval around lines 138-146) to filter results by the same source_id field so users only see their own data.

coderabbitai · 2026-06-19T14:22:30Z

+        docs = [
+            {"context": context, self.EMBEDDING_FIELD: embedding}
+            for context, embedding in zip(embeddata.contexts, embeddata.embeddings)
+        ]


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Inspect Python version declarations relevant to zip(strict=True). fd -HI '^pyproject\.toml$' -x sh -c 'printf "\n== %s ==\n" "$1"; grep -nE "requires-python|target-version" "$1" || true' sh {}

Repository: patchy631/ai-engineering-hub

Length of output: 3695

Add strict=True to prevent silent data truncation.

Line 124 silently truncates if embeddings and contexts have mismatched counts, which risks data loss since the collection is already cleared. The project requires Python ≥3.11, which supports zip(strict=True).

🐛 Proposed fix

docs = [ {"context": context, self.EMBEDDING_FIELD: embedding} - for context, embedding in zip(embeddata.contexts, embeddata.embeddings) + for context, embedding in zip(embeddata.contexts, embeddata.embeddings, strict=True) ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

docs = [

{"context": context, self.EMBEDDING_FIELD: embedding}

for context, embedding in zip(embeddata.contexts, embeddata.embeddings)

]

docs = [

{"context": context, self.EMBEDDING_FIELD: embedding}

for context, embedding in zip(embeddata.contexts, embeddata.embeddings, strict=True)

]

🧰 Tools

🪛 Ruff (0.15.17)

[warning] 124-124: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/rag_code.py` around lines 122 - 125, The zip function in the docs list comprehension does not use the strict parameter, which can silently truncate data if embeddata.contexts and embeddata.embeddings have mismatched lengths. Since the project requires Python 3.11 or later, modify the zip call to include strict=True as a parameter to ensure an error is raised if the two iterables have different lengths, preventing silent data loss during the collection clearing and embedding operation.

Source: Linters/SAST tools

coderabbitai · 2026-06-19T14:22:30Z

+    transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path)
+    documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts]
+
+    embeddata = EmbedData()
+    embeddata.embed(documents)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stop the pipeline when transcription produces no documents.

If the audio has no transcribed turns, this still proceeds into embedding/RAG setup and can return a query engine with no retrievable context.

🐛 Proposed fix

transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path) - documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts] + documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts if t["text"].strip()] + if not documents: + raise ValueError("No transcribable speech was found in the uploaded audio.") embeddata = EmbedData() embeddata.embed(documents)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path)

documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts]

embeddata = EmbedData()

embeddata.embed(documents)

transcripts = Transcribe(api_key=os.getenv("SPEECHMATICS_API_KEY")).transcribe_audio(file_path)

documents = [f"{format_speaker(t['speaker'])}: {t['text']}" for t in transcripts if t["text"].strip()]

if not documents:

raise ValueError("No transcribable speech was found in the uploaded audio.")

embeddata = EmbedData()

embeddata.embed(documents)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/rag_code.py` around lines 252 - 256, The transcription pipeline continues to embedding and RAG setup even when the transcripts are empty, resulting in an unusable query engine. After creating the documents list from the transcripts in the line that calls format_speaker, add a validation check to ensure the documents list is not empty. If documents is empty, raise an appropriate exception to stop the pipeline and prevent proceeding with the embed method call on the EmbedData instance.

coderabbitai · 2026-06-19T14:22:30Z

+## 📬 Stay Updated with Our Newsletter!
+**Get a FREE Data Science eBook** 📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! Stay in the loop with the latest tutorials, insights, and exclusive resources. [Subscribe now!](https://join.dailydoseofds.com)
+
+[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Verify the image URL references the correct repository.

The image link references patchy631/ai-engineering/blob/main/resources/..., but the PR is in patchy631/ai-engineering-hub. This path mismatch will likely result in a broken image (404) when rendered on GitHub, preventing the newsletter call-to-action from displaying correctly.

🔗 Correcting the image URL

-[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com) +[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering-hub/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)

Alternatively, verify the correct repository and path where this image resource actually resides.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)

[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering-hub/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-rag/README.md` at line 65, The image URL in the markdown link on line 65 references the incorrect repository path. Update the URL to replace the repository reference from `patchy631/ai-engineering` to `patchy631/ai-engineering-hub` to match the correct repository where this PR resides. Ensure the full path to the image resource, including the `resources/join_ddods.png` file, matches the actual location in the correct repository so the image displays properly when rendered on GitHub.

Add audio RAG pipeline

8d99d50

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

	[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)
	[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering-hub/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)

Conversation

namanvirk18 commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

namanvirk18 commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading