This project provides an AI-based status monitoring framework, enhanced to process and evaluate performance on public datasets. It analyzes poses in images to detect potential falls and analyzes corresponding audio files to identify help keywords, finally determining the urgency level and evaluating the performance of each component.
This version uses:
- Kaggle Fall Detection Dataset: (
uttejkumarkandagatla/fall-detection-dataset
) for evaluating the visual fall detection component. - Kaggle Speech Emotion Recognition for Emergency Calls Dataset: (
anuvagoyal/speech-emotion-recognition-for-emergency-calls
) for evaluating the auditory keyword detection component.
Core Features:
- Watching Module: Uses MediaPipe for pose estimation on images to detect fall poses (
watching/fall_detector.py
). - Listening Module: Uses SpeechRecognition to process audio files, perform speech-to-text, and identify preset keywords (
listening/keyword_detector.py
). - Decision Logic Module: Combines detection results (placeholder, not the focus of evaluation) (
decision_logic/simple_logic.py
). - Dataset Utilities: Functions for loading Kaggle datasets, parsing labels, and visualizing results (
utils/dataset_utils.py
). - Evaluation Utilities: Functions for calculating standard classification metrics (Accuracy, Precision, Recall, F1-Score) (
utils/evaluation_utils.py
). - Main Program: Takes dataset paths, runs evaluations on specified splits, calculates metrics, and saves detailed results (
main.py
). - Report Generation: Script to generate confusion matrix plots and a summary report from evaluation results (
generate_report.py
).
Please Note: This is a proof-of-concept demonstrating dataset integration and evaluation. The models used (MediaPipe pose for static images, Google Speech Recognition) are basic examples. Their performance on these datasets, as shown in the evaluation, reflects their inherent limitations. More sophisticated models and techniques are required for real-world application.
persistent_watching_listening/
├── watching/ # Visual processing module
│ ├── __init__.py
│ └── fall_detector.py # Fall detector (processes images)
├── listening/ # Auditory processing module
│ ├── __init__.py
│ └── keyword_detector.py # Keyword detector (processes audio files)
├── decision_logic/ # Decision logic module
│ ├── __init__.py
│ └── simple_logic.py # Simple decision logic
├── utils/ # Utility classes/functions
│ ├── __init__.py
│ ├── dataset_utils.py # Kaggle dataset loading/parsing utilities
│ └── evaluation_utils.py # Metrics calculation utilities
├── output/ # Default directory for results
│ ├── fall_detection_results_val.csv
│ ├── keyword_detection_results.csv
│ ├── fall_detection_confusion_matrix.png
│ ├── keyword_detection_confusion_matrix.png
│ ├── evaluation_report.md
│ └── fall_viz/ # Optional fall detection visualizations
│ └── ...
├── main.py # Main program entry point (runs evaluation)
├── generate_report.py # Script to generate report from CSV results
├── requirements.txt # Python dependency list (updated)
└── README.md # This document
Environment Requirements:
- Python 3.8 or higher
- macOS / Linux / Windows (Tested in sandbox Linux environment)
- Internet connection (required for
kagglehub
download and Google Speech Recognition API)
Installation Steps:
-
Clone or Download Project: Extract or clone the project files to your local machine.
-
Create Virtual Environment (Recommended): Open a terminal, navigate to the project root directory, and execute:
python3 -m venv venv source venv/bin/activate # macOS/Linux # venv\Scripts\activate # Windows
-
Install Dependencies: In the activated virtual environment, run:
pip install -r requirements.txt
Note: This installs
opencv-python
,mediapipe
,numpy
,SpeechRecognition
,pandas
,kagglehub
,matplotlib
,seaborn
, andscikit-learn
. -
Download Datasets (Automatic via
kagglehub
): The first time you runmain.py
orutils/dataset_utils.py
, thekagglehub
library will attempt to download the required datasets automatically. You might need to authenticate with Kaggle if you haven't usedkagglehub
before (follow its prompts or configure~/.kaggle/kaggle.json
). The datasets will typically be downloaded to~/.cache/kagglehub/datasets/
.
In the project root directory, with the virtual environment activated, run the main evaluation program using the following command:
python main.py [--fall_dataset_path <path>] [--speech_dataset_path <path>] [--output_dir <path>] [--split <train|val>] [--max_samples <N>] [--visualize]
Command Line Arguments:
--fall_dataset_path
(Optional): Path to the root of the Kaggle Fall Detection dataset. Defaults to the typicalkagglehub
cache location.--speech_dataset_path
(Optional): Path to the root of the Kaggle Speech Emotion dataset. Defaults to the typicalkagglehub
cache location.--output_dir
(Optional): Directory to save evaluation results (CSV files, visualizations). Defaults to./output
.--split
(Optional): Which split of the fall detection dataset to evaluate (train
orval
). Defaults toval
.--max_samples
(Optional): Maximum number of samples to process for each evaluation (fall and keyword). Useful for quick testing. Processes all samples if omitted.--visualize
(Optional): If included, saves annotated images with pose and detection results for the fall detection evaluation to<output_dir>/fall_viz/
.
Example (Evaluate 50 samples from validation split, save visualizations):
python main.py --split val --max_samples 50 --visualize
The program will:
- Load the specified split of the fall detection dataset and the full speech emotion dataset.
- Initialize the
FallDetector
andKeywordDetector
. - Iterate through the fall detection data (up to
max_samples
):- Process each image.
- Compare prediction to ground truth label.
- Optionally save visualization.
- Calculate and print fall detection metrics (Accuracy, Precision, Recall, F1).
- Iterate through the speech emotion data (up to
max_samples
):- Process each audio file.
- Compare predicted keyword presence to ground truth.
- Calculate and print keyword detection metrics (based on presence/absence).
- Save detailed results for both evaluations to CSV files in the output directory.
After running the evaluation (main.py
), you can generate a summary report with confusion matrix plots:
python generate_report.py
This script reads the CSV files generated by main.py
from the ./output
directory (or the directory specified via --output_dir
if you used that) and creates:
evaluation_report.md
: A markdown file summarizing the results with embedded confusion matrices.fall_detection_confusion_matrix.png
: Confusion matrix plot for fall detection.keyword_detection_confusion_matrix.png
: Confusion matrix plot for keyword detection (presence/absence).
(Note: These are results from a small sample run and may vary. Run the evaluation on the full dataset for more reliable metrics.)
Fall Detection (Validation Split):
- TP: 7, FP: 0, TN: 0, FN: 3
- Accuracy: 0.7000
- Precision: 1.0000
- Recall: 0.7000
- F1 Score: 0.8235
Keyword Detection (Presence/Absence):
- TP: 4, FP: 4, TN: 1, FN: 1 (Recognition Failures: 0)
- Accuracy: 0.5000
- Precision: 0.5000
- Recall: 0.8000
- F1 Score: 0.6154
See the generated evaluation_report.md
and associated PNG images in the output directory for more details.
- Model Simplicity: The current models (MediaPipe static pose, basic keyword spotting via ASR) are very basic and have limited accuracy, as reflected in the evaluation metrics. More advanced models (e.g., video-based action recognition, dedicated keyword spotting models) are needed for better performance.
- Keyword Detection Evaluation: The current keyword evaluation only checks for the presence or absence of any target keyword, not the specific keyword accuracy or word error rate, which would be more informative.
- Dataset Limitations: The speech dataset contains emotional speech but not necessarily specific distress calls like "help me I've fallen". The fall dataset contains images, limiting the evaluation of temporal fall detection methods.
- Real-world Applicability: This setup is purely for demonstrating dataset processing and evaluation, not for deployment.
Potential Future Extensions:
- Integrate video-based action recognition models for fall detection.
- Use dedicated keyword spotting or more robust ASR models.
- Implement more sophisticated evaluation metrics (e.g., Word Error Rate for ASR).
- Explore other relevant datasets.
- Combine visual and auditory information using multi-modal models.