This project provides a comprehensive solution for processing hotel booking data, extracting business insights, and enabling retrieval-augmented question answering (RAG) through an LLM-powered API. The system combines data analytics, vector-based retrieval with FAISS, and a lightweight language model to answer questions about hotel bookings.
- LLM-Powered Booking Analytics & QA System
- Happy Coding π
The project is designed to:
- Process hotel booking records: Clean and preprocess raw data.
- Extract analytics: Generate insights such as revenue trends, cancellation rates, geographical distribution of bookings, and booking lead time.
- Answer questions: Use vector-based retrieval and a pre-trained language model (DistilGPT2) to answer queries regarding booking data.
- Expose functionality via REST API: FastAPI endpoints provide access to analytics, Q&A, and system health checks.
.
βββ data
β βββ hotel_bookings.csv
βββ images
βββ notes
β βββ bookinganalytics.pdf
βββ app.py
βββ embeddings.npy
βββ faiss_index.bin
βββ report.docx
βββ hotel_bookings_preprocessed.csv
data/
: Contains the raw hotel bookings dataset.images/
: Folder to store any generated or reference images.notes/
: Documentation or project notes (e.g., PDF reports).app.py
: The main FastAPI application.embeddings.npy
&faiss_index.bin
: Artifacts for the FAISS vector store.report.docx
: Short report explaining implementation choices & challenges.hotel_bookings_preprocessed.csv
: Preprocessed dataset ready for analytics.
Follow these steps to get the project running on your machine:
-
Clone the repository:
git clone https://github.com/yourusername/llm-booking-analytics.git cd llm-booking-analytics
-
Set up the virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Prepare the dataset:
- Ensure that the
data/hotel_bookings.csv
file is available. - Run your preprocessing script (e.g., via a Jupyter Notebook) to generate
hotel_bookings_preprocessed.csv
.
- Ensure that the
-
Start the FastAPI server:
uvicorn app:app --host 0.0.0.0 --port 8000 --reload
For more detailed instructions, refer to the FastAPI documentation.
- Dataset: A sample hotel bookings dataset (CSV) is used. You may also use other relevant datasets as long as they contain required fields.
- Preprocessing:
- Handle missing values and format inconsistencies.
- Convert date fields (e.g.,
arrival_date
) into datetime objects. - Use appropriate data types for numerical fields (e.g.,
int8
,float32
).
- Storage: Preprocessed data is saved as
hotel_bookings_preprocessed.csv
and loaded in the application for analytics and QA.
More on Data Cleaning
- Missing Values: Checked and handled during preprocessing.
- Data Types: Fields such as
is_canceled
,lead_time
,adr
, etc., are cast to optimal types for performance. - Aggregation: Data is grouped (e.g., by month) to compute analytics like revenue trends.
The system computes various analytics from the preprocessed data:
- Revenue Trends: Aggregated monthly revenue calculated using the arrival date.
- Cancellation Rate: Percentage of total bookings that were canceled.
- Geographical Distribution: Top 5 countries based on the number of bookings.
- Booking Lead Time Distribution: Insights into the lead times for bookings.
- Additional Analytics: Easily extendable with further metrics if needed.
Analytics are computed using libraries such as pandas and NumPy, in bookinganalytics.pdf.
The RAG system integrates several components:
-
Embeddings:
- Uses SentenceTransformer (
paraphrase-MiniLM-L6-v2
) to compute vector embeddings for each booking record. - These embeddings are stored in a FAISS index (
faiss_index.bin
andembeddings.npy
).
- Uses SentenceTransformer (
-
Question Answering:
- For a given user question, the system computes its embedding and retrieves the top k similar booking records.
- The retrieved records form the context for the LLM prompt.
- Uses DistilGPT2 to generate an answer based on the context.
How It Works
- Embedding Calculation: Each booking record is converted to a text string and embedded.
- Vector Store: FAISS is used to quickly search and retrieve similar records.
- Prompt Formation: Retrieved records are concatenated to form a context prompt.
- LLM Inference: The language model generates a response based on the provided prompt.
The project exposes three main REST API endpoints:
-
Description: Returns the computed analytics including cancellation rate, revenue trends, and top booking countries.
-
Example Response:
{ "cancellation_rate": 12.34, "monthly_revenue_trend": { "2017-07-31": 123456.78, "2017-08-31": 234567.89 }, "top_countries": { "USA": 1500, "GBR": 900, "FRA": 800 } }
-
Description: Accepts a natural language question about the booking data and returns an answer generated by the LLM.
-
Request Body:
{ "question": "What is the average revenue per booking in July 2017?" }
-
Example Response:
{ "question": "What is the average revenue per booking in July 2017?", "answer": "Based on the retrieved records, the average revenue per booking in July 2017 was approximately $XXX.XX." }
-
Description: Health check endpoint that returns the status of the system along with details like the FAISS index size and model status.
-
Example Response:
{ "status": "ok", "faiss_index_size": 10000, "model_loaded": true }
Example Queries
Here are some sample test queries for your API, along with their expected answers based on the data and functionality of the system:
- Query:
- Request:
POST /analytics
- Expected Answer:
{ "cancellation_rate": 12.34, "monthly_revenue_trend": { "2017-07-31": 50000.00, "2017-08-31": 52000.00, "2017-09-30": 48000.00 }, "top_countries": { "USA": 1500, "GB": 1200, "DE": 1000, "FR": 800, "IT": 600 } }
- Explanation:
- The cancellation rate is 12.34% of total bookings.
- Monthly revenue trends show totals for specific months.
- Top countries are listed with the count of bookings.
- Request:
- Query:
- Request:
POST /ask
{ "question": "Show me total revenue for July 2017." }
- Expected Answer:
{ "question": "Show me total revenue for July 2017.", "answer": "The total revenue for July 2017 is $50,000.00." }
- Explanation:
- The model uses the question to search for relevant booking data (July 2017 revenue).
- Request:
- Query:
- Request:
POST /ask
{ "question": "Which locations had the highest booking cancellations?" }
- Expected Answer:
{ "question": "Which locations had the highest booking cancellations?", "answer": "The locations with the highest booking cancellations are: USA (500 cancellations), GB (400 cancellations), and DE (350 cancellations)." }
- Explanation:
- The model retrieves cancellation data per location and provides the top 3 locations with the most cancellations.
- Request:
- Query:
- Request:
POST /ask
{ "question": "What is the average price of a hotel booking?" }
- Expected Answer:
{ "question": "What is the average price of a hotel booking?", "answer": "The average price of a hotel booking is $150.00 per night." }
- Explanation:
- The model calculates the average daily rate (ADR) of hotel bookings.
- Request:
- Query:
- Request:
POST /ask
{ "question": "How many bookings were made in 2019?" }
- Expected Answer:
{ "question": "How many bookings were made in 2019?", "answer": "A total of 12,000 bookings were made in 2019." }
- Explanation:
- The model searches the data for the number of bookings within the year 2019.
- Request:
- Query:
- Request:
POST /ask
{ "question": "What was the booking lead time distribution?" }
- Expected Answer:
{ "question": "What was the booking lead time distribution?", "answer": "The lead time distribution is as follows: 0-30 days (45%), 31-60 days (30%), 61+ days (25%)." }
- Explanation:
- The model provides a breakdown of lead time distribution from the dataset.
- Request:
- Accuracy: The accuracy of Q&A responses can be evaluated by comparing generated answers with expected outcomes for a set of test queries.
- Response Time: Middleware in
app.py
logs the processing time for each API request. - Optimization: Ensure that the FAISS index is pre-built and stored to speed up retrieval. Use batching and efficient preprocessing where applicable.
Evaluation Metrics
- API Response Time: Measured via the
X-Process-Time
header. - Retrieval Speed: Number of vectors in the FAISS index indicates scalability.
- LLM Response Quality: Validate through sample queries and user feedback.
To deploy the system:
- Local Deployment: Run the FastAPI app locally using Uvicorn.
- Production Deployment: Consider deploying with a production-ready ASGI server (e.g., Gunicorn with Uvicorn workers) behind a reverse proxy.
- Containerization: Package the solution with Docker for ease of deployment.
- Real-time Data Updates: Integrate with a database (SQLite, PostgreSQL) to update analytics as new data arrives.
- Query History Tracking: Implement a logging mechanism to track user queries.
- Additional Endpoints: For example, a dedicated endpoint for retrieving historical query logs.
- Enhanced Analytics: Add more metrics and visualizations based on user needs.
Bonus Features
- Health Check Enhancements: Extend the
/health
endpoint to verify connectivity with external services. - User Authentication: Secure API endpoints with authentication and authorization.
-
Preprocessed Data Not Found:
Ensure thathotel_bookings_preprocessed.csv
is generated and placed in the project root before starting the server. -
Dependency Issues:
Verify that all dependencies listed inrequirements.txt
are installed in your virtual environment. -
Performance Bottlenecks:
For large datasets, consider increasing the chunk size when reading CSVs and optimizing the FAISS index building process.
Common Issues
- FAISS Index Not Loading: Check that
faiss_index.bin
andembeddings.npy
exist. If not, the application will automatically compute and save these. - LLM Response Latency: Fine-tune LLM generation parameters (e.g.,
max_length
,temperature
) if responses are slow.
For questions or feedback, please reach out via:
- GitHub: @arindal1
- LinkedIn: Arindal Char
- FastAPI Documentation: FastAPI
- Hugging Face Transformers: Transformers
- Datasets: Sample Hotel Bookings Dataset