RedditImageDownloader is a lightweight Python tool for batch-downloading Reddit-hosted images, tailored for AI/ML workflows like training generative models.
- Batch Download - Fetch media from Reddit posts using asyncpraw, an asynchronous Python Reddit API wrapper.
- MD5 Deduplication - Built-in MD5 hash checking to auto-remove reposts and duplicates across multiple subreddits, ensuring dataset quality.
- Async/Await Download Flow - High-speed, non-blocking downloads to handle large datasets efficiently.
- Simple CLI - powered by argparse for flexible dataset crawling without coding.
- Docker-Ready - Easily containerized for reproducible and environment-independent dataset crawling.
- clone the repository
git clone https://github.com/kaledgar/RedditImageDownloader
- Create
authorized reddit application
, read aboutReddit API
and obtain the necessary credentials, such as the client ID, client secret, username, password, and user agent. Store these credentials in a JSON filecredentials.json
in your local repository that you cloned.
{
"username": "your reddit username",
"password": "pw to your reddit account",
"user_agent": "anything here",
"client_secret": "client secret of reddit app you create",
"client_id": "app id, see below for details"
}
- Customize the constants.py file if needed, adjusting default file paths or other constants according to your preferences.
- Install the required dependencies:
# Install requirements
pip install -r requirements.txt
# Check possible arguments
python3 -m reddit_image_downloader -h
# Run module with your custom arguments
python3 -m reddit_image_downloader -rd -u example_user -d '/mnt/d/downloads'
The last command runs the script and downloads media from users given in list and saves it in separate directories.
To use the "Reddit Image Downloader" with Docker, follow these steps:
- Adjust the Dockerfile up to your preferences
# build docker image
docker build -t reddit-image-downloader .
# run
docker run -v /your/local/directory:/app/downloads reddit-image-downloader
To use pre-commit
during the development run:
python3 -m venv .vev
source .venv/bin/activate
pip install pre-commit
pre-commit install
.pre-commit-config.yaml
stores the pre-commit
configuration.
In authorized reddit application
settings:
- WIN - Run the script in Powershell Admin session
- Linux - run script with sudo