AskMe-FAQ-Bot

A production-ready FAQ chatbot that combines semantic search (Gemini embeddings + FAISS) with lexical fuzzy matching and a clean UI in Streamlit (default) or Gradio. It supports model management (upload/retrain), fallback capture, and structured logging.

✨ Highlights

Hybrid retrieval:
- Semantic: Google Gemini text-embedding-004 via google-generativeai
- Lexical: RapidFuzz token_set_ratio
- Fusion: combined = α * semantic + (1-α) * lexical with a configurable threshold
Two UIs:
- Streamlit app (app.py) – polished, turn-based chat UI with avatars/timestamps
- Gradio app (gradio_app.py) – two tabs: Chat & Model Management
Model management: Upload CSVs to retrain (merge) or train new (replace); FAISS index rebuilt on demand
Fallback capture: Unanswered questions are appended to fallback.csv and downloadable from the UI
Robust logging: Daily-rotating logs to logs/daily.log and aggregate logs/app.log
Index persistence: FAISS artifacts in .faiss/ auto-loaded on startup; reindex only when needed
Python 3.11 + uv compatible (fast & reproducible envs)

🗂️ Repository Layout

AskMe-FAQ-Bot/
├─ .faiss/                         # Persisted FAISS index & metadata
│  ├─ index_<hash>.bin
│  └─ meta_<hash>.parquet
├─ logs/
│  ├─ app.log                      # Aggregate log (append)
│  └─ daily.log                    # Rotating daily log
├─ Sample Data/
│  ├─ bank_faq.csv
│  └─ data_science_faq.csv
├─ .env                            # Environment variables (see below)
├─ app.py                          # Streamlit UI (default app)
├─ gradio_app.py                   # Gradio UI (optional)
├─ basic_faq.csv                   # Base FAQ KB (question_id, question, answer)
├─ fallback.csv                    # Auto-appended unanswered questions
├─ utility.py                      # Embedding, FAISS, hybrid search, fallback
├─ preprocessing.py                # Sanitization & normalization
├─ logger.py                       # Logging setup (console + files)
├─ config.py                       # Config & paths
├─ pyproject.toml                  # Project metadata & dependencies
└─ uv.lock                         # Resolved lockfile (if using uv)

🔧 Requirements

Python: 3.11
Package manager: uv (recommended) or pip
Google Generative AI access + API key

📦 Setup (with `uv`)

You can use pip if you prefer; uv is just faster.

# 1) Create & activate a virtual environment
uv venv .venv
# Windows PowerShell:
#   .venv\Scripts\Activate.ps1
# macOS/Linux:
#   source .venv/bin/activate

# 2) Install dependencies
uv pip install -e .
# or (if you want to lock & sync)
# uv pip sync pyproject.toml

Environment Variables (`.env`)

Create .env in the project root with at least:

GEMINI_API_KEY=YOUR_API_KEY_HERE

Note: The code automatically normalizes Gemini model ids to include the models/ prefix, so both text-embedding-004 and models/text-embedding-004 work.

▶️ Running

Streamlit UI (default)

streamlit run app.py

Chat tab: ask questions; hybrid retrieval returns the best match.
Model Management tab: upload a CSV and choose Retrain (merge) or Train New (replace).
Fallbacks: download fallback.csv from the sidebar.

Gradio UI (optional)

python gradio_app.py

Two tabs (Chatbot Interface & Model Management) with progress output.

📄 Data Format

basic_faq.csv (and any uploaded CSV) must include these columns:

column	description
`question_id`	unique identifier for each FAQ
`question`	the FAQ question text
`answer`	the corresponding answer text

Example:

question_id,question,answer
q001,How do I download my account statement?,In the app/netbanking, go to Accounts > Statements...
q002,Where do I find the CVV2?,The CVV2 is the 3-digit code on the back of your card...

🔍 Retrieval Pipeline

Exact normalized match: fast lookup on a lowercased/alphanumeric map
Semantic search: Gemini embeddings → FAISS IP on L2-normalized vectors (≈ cosine)
Lexical fuzzy: RapidFuzz token-set ratio in [0,1]
Score fusion: combined = α * semantic + (1-α) * lexical
Thresholding: if combined >= THRESHOLD, return the answer; otherwise record a fallback and reply politely (via Gemini or static text)

🔁 Indexing & Caching

On startup, the app first tries to load .faiss/index_*.bin & meta_*.parquet.
If no index exists, and basic_faq.csv is available, it builds a fresh index.
If the embedding model changes (dimension mismatch), the index is rebuilt automatically.
Re-training from the UI updates basic_faq.csv and rebuilds FAISS.

🧰 Logging

Logs to console and to files under logs/:
- logs/app.log (aggregate append)
- logs/daily.log (rotated at midnight, 14 backups)
Typical entries include: startup, indexing phases, query latencies, training steps, and errors with stack traces.

🧪 Quick Test

Ensure basic_faq.csv exists (you can copy from Sample Data or use the generator).
Start Streamlit: streamlit run app.py
Ask: download account statement → should match the canonical Q “How do I download my account statement?”
Try paraphrases; hybrid retrieval should handle non-exact wording.

❗ Troubleshooting

Invalid model name: Ensure your .env is set and reachable. The app normalizes model ids; both text-embedding-004 and models/text-embedding-004 are fine.
Index rebuild every start: We prefer load then build. If you still see rebuilds, check file permissions on .faiss/ and that the embed model hasn’t changed.
FAISS dim mismatch: Happens if you switch embedding models. We rebuild automatically from basic_faq.csv.
No answers found: Check THRESHOLD and whether your paraphrase is too far from the dataset. New questions go into fallback.csv for curation.

🔐 Security & Privacy

Never log raw API keys.
The CSVs may contain sensitive info; treat repository access accordingly.
For production: consider secrets managers, role-based access, HTTPS termination, and a managed vector store if needed.

🗺️ Roadmap Ideas

Multi-tenant or per-department FAQ banks
Admin UI for editing Q&A and curating fallbacks inline
Citation/highlighting (which FAQ matched and why)
Batch eval harness (precision/recall) for new datasets
Dockerfile + CI/CD templates

💬 Acknowledgements

Google Generative AI (google-generativeai)
FAISS (Facebook/Meta AI)
Streamlit / Gradio
RapidFuzz

📄 License

Choose and add a license (e.g., MIT) if this project will be shared publicly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AskMe-FAQ-Bot

✨ Highlights

🗂️ Repository Layout

🔧 Requirements

📦 Setup (with `uv`)

Environment Variables (`.env`)

▶️ Running

Streamlit UI (default)

Gradio UI (optional)

📄 Data Format

🔍 Retrieval Pipeline

🔁 Indexing & Caching

🧰 Logging

🧪 Quick Test

❗ Troubleshooting

🔐 Security & Privacy

🗺️ Roadmap Ideas

💬 Acknowledgements

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.faiss		.faiss
Sample Data		Sample Data
logs		logs
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
basic_faq.csv		basic_faq.csv
config.py		config.py
fallback.csv		fallback.csv
gradio_app.py		gradio_app.py
logger.py		logger.py
preprocessing.py		preprocessing.py
pyproject.toml		pyproject.toml
utility.py		utility.py
uv.lock		uv.lock

devdatta95/AskMe-FAQ-Chatbot

Folders and files

Latest commit

History

Repository files navigation

AskMe-FAQ-Bot

✨ Highlights

🗂️ Repository Layout

🔧 Requirements

📦 Setup (with uv)

Environment Variables (.env)

▶️ Running

Streamlit UI (default)

Gradio UI (optional)

📄 Data Format

🔍 Retrieval Pipeline

🔁 Indexing & Caching

🧰 Logging

🧪 Quick Test

❗ Troubleshooting

🔐 Security & Privacy

🗺️ Roadmap Ideas

💬 Acknowledgements

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

📦 Setup (with `uv`)

Environment Variables (`.env`)

Packages