Skip to content

This project processes booking data, generates insightful reports, and answers user queries through a RAG question answering system. Built with FastAPI, FAISS, and Hugging Face LLM models.

Notifications You must be signed in to change notification settings

arindal1/LLM-Booking-Analytics-and-RAG-QA

Repository files navigation

LLM-Powered Booking Analytics & QA System

This project provides a comprehensive solution for processing hotel booking data, extracting business insights, and enabling retrieval-augmented question answering (RAG) through an LLM-powered API. The system combines data analytics, vector-based retrieval with FAISS, and a lightweight language model to answer questions about hotel bookings.


Table of Contents


Overview

The project is designed to:

  • Process hotel booking records: Clean and preprocess raw data.
  • Extract analytics: Generate insights such as revenue trends, cancellation rates, geographical distribution of bookings, and booking lead time.
  • Answer questions: Use vector-based retrieval and a pre-trained language model (DistilGPT2) to answer queries regarding booking data.
  • Expose functionality via REST API: FastAPI endpoints provide access to analytics, Q&A, and system health checks.

Project Structure

.
β”œβ”€β”€ data
β”‚   └── hotel_bookings.csv
β”œβ”€β”€ images
β”œβ”€β”€ notes
β”‚   └── bookinganalytics.pdf
β”œβ”€β”€ app.py
β”œβ”€β”€ embeddings.npy
β”œβ”€β”€ faiss_index.bin
β”œβ”€β”€ report.docx
└── hotel_bookings_preprocessed.csv
  • data/: Contains the raw hotel bookings dataset.
  • images/: Folder to store any generated or reference images.
  • notes/: Documentation or project notes (e.g., PDF reports).
  • app.py: The main FastAPI application.
  • embeddings.npy & faiss_index.bin: Artifacts for the FAISS vector store.
  • report.docx: Short report explaining implementation choices & challenges.
  • hotel_bookings_preprocessed.csv: Preprocessed dataset ready for analytics.

Installation & Setup

Follow these steps to get the project running on your machine:

  1. Clone the repository:

    git clone https://github.com/yourusername/llm-booking-analytics.git
    cd llm-booking-analytics
  2. Set up the virtual environment and install dependencies:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    pip install -r requirements.txt
  3. Prepare the dataset:

    • Ensure that the data/hotel_bookings.csv file is available.
    • Run your preprocessing script (e.g., via a Jupyter Notebook) to generate hotel_bookings_preprocessed.csv.
  4. Start the FastAPI server:

    uvicorn app:app --host 0.0.0.0 --port 8000 --reload

For more detailed instructions, refer to the FastAPI documentation.


Data Collection & Preprocessing

  • Dataset: A sample hotel bookings dataset (CSV) is used. You may also use other relevant datasets as long as they contain required fields.
  • Preprocessing:
    • Handle missing values and format inconsistencies.
    • Convert date fields (e.g., arrival_date) into datetime objects.
    • Use appropriate data types for numerical fields (e.g., int8, float32).
  • Storage: Preprocessed data is saved as hotel_bookings_preprocessed.csv and loaded in the application for analytics and QA.
More on Data Cleaning
  • Missing Values: Checked and handled during preprocessing.
  • Data Types: Fields such as is_canceled, lead_time, adr, etc., are cast to optimal types for performance.
  • Aggregation: Data is grouped (e.g., by month) to compute analytics like revenue trends.

Analytics & Reporting

The system computes various analytics from the preprocessed data:

  • Revenue Trends: Aggregated monthly revenue calculated using the arrival date.
  • Cancellation Rate: Percentage of total bookings that were canceled.
  • Geographical Distribution: Top 5 countries based on the number of bookings.
  • Booking Lead Time Distribution: Insights into the lead times for bookings.
  • Additional Analytics: Easily extendable with further metrics if needed.

Analytics are computed using libraries such as pandas and NumPy, in bookinganalytics.pdf.


Retrieval-Augmented Question Answering (RAG)

The RAG system integrates several components:

  • Embeddings:

    • Uses SentenceTransformer (paraphrase-MiniLM-L6-v2) to compute vector embeddings for each booking record.
    • These embeddings are stored in a FAISS index (faiss_index.bin and embeddings.npy).
  • Question Answering:

    • For a given user question, the system computes its embedding and retrieves the top k similar booking records.
    • The retrieved records form the context for the LLM prompt.
    • Uses DistilGPT2 to generate an answer based on the context.
How It Works
  1. Embedding Calculation: Each booking record is converted to a text string and embedded.
  2. Vector Store: FAISS is used to quickly search and retrieve similar records.
  3. Prompt Formation: Retrieved records are concatenated to form a context prompt.
  4. LLM Inference: The language model generates a response based on the provided prompt.

API Endpoints

The project exposes three main REST API endpoints:

POST /analytics

  • Description: Returns the computed analytics including cancellation rate, revenue trends, and top booking countries.

  • Example Response:

    {
        "cancellation_rate": 12.34,
        "monthly_revenue_trend": {
            "2017-07-31": 123456.78,
            "2017-08-31": 234567.89
        },
        "top_countries": {
            "USA": 1500,
            "GBR": 900,
            "FRA": 800
        }
    }

POST /ask

  • Description: Accepts a natural language question about the booking data and returns an answer generated by the LLM.

  • Request Body:

    {
        "question": "What is the average revenue per booking in July 2017?"
    }
  • Example Response:

    {
        "question": "What is the average revenue per booking in July 2017?",
        "answer": "Based on the retrieved records, the average revenue per booking in July 2017 was approximately $XXX.XX."
    }

GET /health

  • Description: Health check endpoint that returns the status of the system along with details like the FAISS index size and model status.

  • Example Response:

    {
        "status": "ok",
        "faiss_index_size": 10000,
        "model_loaded": true
    }
Example Queries

Here are some sample test queries for your API, along with their expected answers based on the data and functionality of the system:


Sample Test Queries for /analytics Endpoint

  1. Query:
    • Request: POST /analytics
    • Expected Answer:
      {
        "cancellation_rate": 12.34,
        "monthly_revenue_trend": {
          "2017-07-31": 50000.00,
          "2017-08-31": 52000.00,
          "2017-09-30": 48000.00
        },
        "top_countries": {
          "USA": 1500,
          "GB": 1200,
          "DE": 1000,
          "FR": 800,
          "IT": 600
        }
      }
    • Explanation:
      • The cancellation rate is 12.34% of total bookings.
      • Monthly revenue trends show totals for specific months.
      • Top countries are listed with the count of bookings.

Sample Test Queries for /ask Endpoint

  1. Query:
    • Request: POST /ask
      {
        "question": "Show me total revenue for July 2017."
      }
    • Expected Answer:
      {
        "question": "Show me total revenue for July 2017.",
        "answer": "The total revenue for July 2017 is $50,000.00."
      }
    • Explanation:
      • The model uses the question to search for relevant booking data (July 2017 revenue).

  1. Query:
    • Request: POST /ask
      {
        "question": "Which locations had the highest booking cancellations?"
      }
    • Expected Answer:
      {
        "question": "Which locations had the highest booking cancellations?",
        "answer": "The locations with the highest booking cancellations are: USA (500 cancellations), GB (400 cancellations), and DE (350 cancellations)."
      }
    • Explanation:
      • The model retrieves cancellation data per location and provides the top 3 locations with the most cancellations.

  1. Query:
    • Request: POST /ask
      {
        "question": "What is the average price of a hotel booking?"
      }
    • Expected Answer:
      {
        "question": "What is the average price of a hotel booking?",
        "answer": "The average price of a hotel booking is $150.00 per night."
      }
    • Explanation:
      • The model calculates the average daily rate (ADR) of hotel bookings.

  1. Query:
    • Request: POST /ask
      {
        "question": "How many bookings were made in 2019?"
      }
    • Expected Answer:
      {
        "question": "How many bookings were made in 2019?",
        "answer": "A total of 12,000 bookings were made in 2019."
      }
    • Explanation:
      • The model searches the data for the number of bookings within the year 2019.

  1. Query:
    • Request: POST /ask
      {
        "question": "What was the booking lead time distribution?"
      }
    • Expected Answer:
      {
        "question": "What was the booking lead time distribution?",
        "answer": "The lead time distribution is as follows: 0-30 days (45%), 31-60 days (30%), 61+ days (25%)."
      }
    • Explanation:
      • The model provides a breakdown of lead time distribution from the dataset.

Performance Evaluation

  • Accuracy: The accuracy of Q&A responses can be evaluated by comparing generated answers with expected outcomes for a set of test queries.
  • Response Time: Middleware in app.py logs the processing time for each API request.
  • Optimization: Ensure that the FAISS index is pre-built and stored to speed up retrieval. Use batching and efficient preprocessing where applicable.
Evaluation Metrics
  • API Response Time: Measured via the X-Process-Time header.
  • Retrieval Speed: Number of vectors in the FAISS index indicates scalability.
  • LLM Response Quality: Validate through sample queries and user feedback.

Deployment

To deploy the system:

  1. Local Deployment: Run the FastAPI app locally using Uvicorn.
  2. Production Deployment: Consider deploying with a production-ready ASGI server (e.g., Gunicorn with Uvicorn workers) behind a reverse proxy.
  3. Containerization: Package the solution with Docker for ease of deployment.

Future Enhancements

  • Real-time Data Updates: Integrate with a database (SQLite, PostgreSQL) to update analytics as new data arrives.
  • Query History Tracking: Implement a logging mechanism to track user queries.
  • Additional Endpoints: For example, a dedicated endpoint for retrieving historical query logs.
  • Enhanced Analytics: Add more metrics and visualizations based on user needs.
Bonus Features
  • Health Check Enhancements: Extend the /health endpoint to verify connectivity with external services.
  • User Authentication: Secure API endpoints with authentication and authorization.

Troubleshooting & Tips

  • Preprocessed Data Not Found:
    Ensure that hotel_bookings_preprocessed.csv is generated and placed in the project root before starting the server.

  • Dependency Issues:
    Verify that all dependencies listed in requirements.txt are installed in your virtual environment.

  • Performance Bottlenecks:
    For large datasets, consider increasing the chunk size when reading CSVs and optimizing the FAISS index building process.

Common Issues
  • FAISS Index Not Loading: Check that faiss_index.bin and embeddings.npy exist. If not, the application will automatically compute and save these.
  • LLM Response Latency: Fine-tune LLM generation parameters (e.g., max_length, temperature) if responses are slow.

Contact

For questions or feedback, please reach out via:


References & Useful Links


Happy Coding πŸš€

About

This project processes booking data, generates insightful reports, and answers user queries through a RAG question answering system. Built with FastAPI, FAISS, and Hugging Face LLM models.

Topics

Resources

Stars

Watchers

Forks

Languages