LLM-Powered Booking Analytics & QA System

This project provides a comprehensive solution for processing hotel booking data, extracting business insights, and enabling retrieval-augmented question answering (RAG) through an LLM-powered API. The system combines data analytics, vector-based retrieval with FAISS, and a lightweight language model to answer questions about hotel bookings.

Overview

The project is designed to:

Process hotel booking records: Clean and preprocess raw data.
Extract analytics: Generate insights such as revenue trends, cancellation rates, geographical distribution of bookings, and booking lead time.
Answer questions: Use vector-based retrieval and a pre-trained language model (DistilGPT2) to answer queries regarding booking data.
Expose functionality via REST API: FastAPI endpoints provide access to analytics, Q&A, and system health checks.

Project Structure

.
├── data
│   └── hotel_bookings.csv
├── images
├── notes
│   └── bookinganalytics.pdf
├── app.py
├── embeddings.npy
├── faiss_index.bin
├── report.docx
└── hotel_bookings_preprocessed.csv

data/: Contains the raw hotel bookings dataset.
images/: Folder to store any generated or reference images.
notes/: Documentation or project notes (e.g., PDF reports).
app.py: The main FastAPI application.
embeddings.npy & faiss_index.bin: Artifacts for the FAISS vector store.
report.docx: Short report explaining implementation choices & challenges.
hotel_bookings_preprocessed.csv: Preprocessed dataset ready for analytics.

Installation & Setup

Follow these steps to get the project running on your machine:

Clone the repository:

git clone https://github.com/yourusername/llm-booking-analytics.git
cd llm-booking-analytics

Set up the virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Prepare the dataset:
- Ensure that the data/hotel_bookings.csv file is available.
- Run your preprocessing script (e.g., via a Jupyter Notebook) to generate hotel_bookings_preprocessed.csv.

Start the FastAPI server:

uvicorn app:app --host 0.0.0.0 --port 8000 --reload

For more detailed instructions, refer to the FastAPI documentation.

Data Collection & Preprocessing

Dataset: A sample hotel bookings dataset (CSV) is used. You may also use other relevant datasets as long as they contain required fields.
Preprocessing:
- Handle missing values and format inconsistencies.
- Convert date fields (e.g., arrival_date) into datetime objects.
- Use appropriate data types for numerical fields (e.g., int8, float32).
Storage: Preprocessed data is saved as hotel_bookings_preprocessed.csv and loaded in the application for analytics and QA.

Analytics & Reporting

The system computes various analytics from the preprocessed data:

Revenue Trends: Aggregated monthly revenue calculated using the arrival date.
Cancellation Rate: Percentage of total bookings that were canceled.
Geographical Distribution: Top 5 countries based on the number of bookings.
Booking Lead Time Distribution: Insights into the lead times for bookings.
Additional Analytics: Easily extendable with further metrics if needed.

Analytics are computed using libraries such as pandas and NumPy, in bookinganalytics.pdf.

Retrieval-Augmented Question Answering (RAG)

The RAG system integrates several components:

Embeddings:
- Uses SentenceTransformer (paraphrase-MiniLM-L6-v2) to compute vector embeddings for each booking record.
- These embeddings are stored in a FAISS index (faiss_index.bin and embeddings.npy).
Question Answering:
- For a given user question, the system computes its embedding and retrieves the top k similar booking records.
- The retrieved records form the context for the LLM prompt.
- Uses DistilGPT2 to generate an answer based on the context.

How It Works

Embedding Calculation: Each booking record is converted to a text string and embedded.
Vector Store: FAISS is used to quickly search and retrieve similar records.
Prompt Formation: Retrieved records are concatenated to form a context prompt.
LLM Inference: The language model generates a response based on the provided prompt.

API Endpoints

The project exposes three main REST API endpoints:

POST `/analytics`

Description: Returns the computed analytics including cancellation rate, revenue trends, and top booking countries.

Example Response:

{
    "cancellation_rate": 12.34,
    "monthly_revenue_trend": {
        "2017-07-31": 123456.78,
        "2017-08-31": 234567.89
    },
    "top_countries": {
        "USA": 1500,
        "GBR": 900,
        "FRA": 800
    }
}

POST `/ask`

Description: Accepts a natural language question about the booking data and returns an answer generated by the LLM.

Request Body:

{
    "question": "What is the average revenue per booking in July 2017?"
}

Example Response:

{
    "question": "What is the average revenue per booking in July 2017?",
    "answer": "Based on the retrieved records, the average revenue per booking in July 2017 was approximately $XXX.XX."
}

GET `/health`

Description: Health check endpoint that returns the status of the system along with details like the FAISS index size and model status.

Example Response:

{
    "status": "ok",
    "faiss_index_size": 10000,
    "model_loaded": true
}

Example Queries

Here are some sample test queries for your API, along with their expected answers based on the data and functionality of the system:

Sample Test Queries for `/analytics` Endpoint

Query:

Request: POST /analytics

Expected Answer:

{
  "cancellation_rate": 12.34,
  "monthly_revenue_trend": {
    "2017-07-31": 50000.00,
    "2017-08-31": 52000.00,
    "2017-09-30": 48000.00
  },
  "top_countries": {
    "USA": 1500,
    "GB": 1200,
    "DE": 1000,
    "FR": 800,
    "IT": 600
  }
}

Explanation:
- The cancellation rate is 12.34% of total bookings.
- Monthly revenue trends show totals for specific months.
- Top countries are listed with the count of bookings.

Sample Test Queries for `/ask` Endpoint

Query:

Request: POST /ask

{
  "question": "Show me total revenue for July 2017."
}

Expected Answer:

{
  "question": "Show me total revenue for July 2017.",
  "answer": "The total revenue for July 2017 is $50,000.00."
}

Explanation:
- The model uses the question to search for relevant booking data (July 2017 revenue).

Query:

Request: POST /ask

{
  "question": "Which locations had the highest booking cancellations?"
}

Expected Answer:

{
  "question": "Which locations had the highest booking cancellations?",
  "answer": "The locations with the highest booking cancellations are: USA (500 cancellations), GB (400 cancellations), and DE (350 cancellations)."
}

Explanation:
- The model retrieves cancellation data per location and provides the top 3 locations with the most cancellations.

Query:

Request: POST /ask

{
  "question": "What is the average price of a hotel booking?"
}

Expected Answer:

{
  "question": "What is the average price of a hotel booking?",
  "answer": "The average price of a hotel booking is $150.00 per night."
}

Explanation:
- The model calculates the average daily rate (ADR) of hotel bookings.

Query:

Request: POST /ask

{
  "question": "How many bookings were made in 2019?"
}

Expected Answer:

{
  "question": "How many bookings were made in 2019?",
  "answer": "A total of 12,000 bookings were made in 2019."
}

Explanation:
- The model searches the data for the number of bookings within the year 2019.

Query:

Request: POST /ask

{
  "question": "What was the booking lead time distribution?"
}

Expected Answer:

{
  "question": "What was the booking lead time distribution?",
  "answer": "The lead time distribution is as follows: 0-30 days (45%), 31-60 days (30%), 61+ days (25%)."
}

Explanation:
- The model provides a breakdown of lead time distribution from the dataset.

Performance Evaluation

Accuracy: The accuracy of Q&A responses can be evaluated by comparing generated answers with expected outcomes for a set of test queries.
Response Time: Middleware in app.py logs the processing time for each API request.
Optimization: Ensure that the FAISS index is pre-built and stored to speed up retrieval. Use batching and efficient preprocessing where applicable.

Evaluation Metrics

API Response Time: Measured via the X-Process-Time header.
Retrieval Speed: Number of vectors in the FAISS index indicates scalability.
LLM Response Quality: Validate through sample queries and user feedback.

Deployment

To deploy the system:

Local Deployment: Run the FastAPI app locally using Uvicorn.
Production Deployment: Consider deploying with a production-ready ASGI server (e.g., Gunicorn with Uvicorn workers) behind a reverse proxy.
Containerization: Package the solution with Docker for ease of deployment.

Future Enhancements

Real-time Data Updates: Integrate with a database (SQLite, PostgreSQL) to update analytics as new data arrives.
Query History Tracking: Implement a logging mechanism to track user queries.
Additional Endpoints: For example, a dedicated endpoint for retrieving historical query logs.
Enhanced Analytics: Add more metrics and visualizations based on user needs.

Bonus Features

Health Check Enhancements: Extend the /health endpoint to verify connectivity with external services.
User Authentication: Secure API endpoints with authentication and authorization.

Troubleshooting & Tips

Preprocessed Data Not Found:
Ensure that hotel_bookings_preprocessed.csv is generated and placed in the project root before starting the server.
Dependency Issues:
Verify that all dependencies listed in requirements.txt are installed in your virtual environment.
Performance Bottlenecks:
For large datasets, consider increasing the chunk size when reading CSVs and optimizing the FAISS index building process.

Common Issues

FAISS Index Not Loading: Check that faiss_index.bin and embeddings.npy exist. If not, the application will automatically compute and save these.
LLM Response Latency: Fine-tune LLM generation parameters (e.g., max_length, temperature) if responses are slow.

Contact

For questions or feedback, please reach out via:

GitHub: @arindal1
LinkedIn: Arindal Char

References & Useful Links

FastAPI Documentation: FastAPI
Hugging Face Transformers: Transformers
Datasets: Sample Hotel Bookings Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
images		images
notes		notes
.gitattributes		.gitattributes
.gitignore		.gitignore
3.9.2		3.9.2
README.md		README.md
Solvei8 AI_ML Internship Assignment.pdf		Solvei8 AI_ML Internship Assignment.pdf
app.py		app.py
bookinganalytics.ipynb		bookinganalytics.ipynb
hotel_bookings_preprocessed.csv		hotel_bookings_preprocessed.csv
report.docx		report.docx
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Powered Booking Analytics & QA System

Table of Contents

Overview

Project Structure

Installation & Setup

Data Collection & Preprocessing

Analytics & Reporting

Retrieval-Augmented Question Answering (RAG)

API Endpoints

POST `/analytics`

POST `/ask`

GET `/health`

Sample Test Queries for `/analytics` Endpoint

Sample Test Queries for `/ask` Endpoint

Performance Evaluation

Deployment

Future Enhancements

Troubleshooting & Tips

Contact

References & Useful Links

Happy Coding 🚀

About

Uh oh!

Uh oh!

Languages

arindal1/LLM-Booking-Analytics-and-RAG-QA

Folders and files

Latest commit

History

Repository files navigation

LLM-Powered Booking Analytics & QA System

Table of Contents

Overview

Project Structure

Installation & Setup

Data Collection & Preprocessing

Analytics & Reporting

Retrieval-Augmented Question Answering (RAG)

API Endpoints

POST /analytics

POST /ask

GET /health

Sample Test Queries for /analytics Endpoint

Sample Test Queries for /ask Endpoint

Performance Evaluation

Deployment

Future Enhancements

Troubleshooting & Tips

Contact

References & Useful Links

Happy Coding 🚀

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

POST `/analytics`

POST `/ask`

GET `/health`

Sample Test Queries for `/analytics` Endpoint

Sample Test Queries for `/ask` Endpoint