Science Study Scraper is a powerful tool designed to automatically find and download scientific studies from open access repositories on any topic in the scientific and medical field. This tool is part of https://bio-hacking.blog and is used as a research tool.
- Multi-Database Search: Search across PubMed, PMC, Europe PMC, bioRxiv, ScienceDirect, DOAJ, Semantic Scholar, and Google Scholar
- Automatic PDF Downloads: Download full-text PDFs when available
- Content Extraction: Create PDF documents from article content when direct PDFs are unavailable
- Interactive HTML Reports: Generate beautiful, interactive reports of your search results
- Flexible Query Building: Refine searches with additional terms and filters
- Query Management: Save and load previous search queries
- Test Mode: Try out the scraper with limited downloads before a full run
- Python 3.7 or higher
- Required packages:
requests
,beautifulsoup4
,pandas
,reportlab
-
Clone this repository:
git clone https://github.com/Ark0N/ScienceStudiesScraper.git cd science-study-scraper
-
Install required packages:
cd science_scraper pip install -r requirements.txt
Run a search with a specific query:
python main.py --query "diabetes treatment"
Use the test option to download only one study per database:
python main.py --test --query "nmn"
python main.py --query "cancer immunotherapy" --terms "clinical trial" "systematic review" --max-results 50 --databases pubmed pmc --delay 2
Option | Description |
---|---|
--query , -q |
Main search query (required if no saved query) |
--terms , -t |
Additional search terms to refine results |
--output , -o |
Output directory for downloaded studies (default: 'studies') |
--max-results , -m |
Maximum number of results to retrieve (default: unlimited) |
--delay , -d |
Delay between requests in seconds (default: 1) |
--databases |
Databases to search (choices: pubmed, pmc, europepmc, biorxiv, sciencedirect, doaj, semanticscholar, googlescholar, all) |
--test |
Test mode: only download one study per database |
--save-query |
Save the current query for future use |
--load-saved |
Load the previously saved query |
Quick Test Run:
python main.py --query "alzheimer's disease" --test
Comprehensive Research:
python main.py --query "COVID-19" --terms "treatment" "vaccine" "long COVID" --databases all --save-query
Follow-up Research:
python main.py --load-saved --max-results 100
The Science Study Scraper generates several types of output:
- PDFs: Downloaded and generated PDFs are stored in the
studies/pdfs
directory - CSV Data: Detailed study information in CSV format
- JSON Data: Complete study data in JSON format
- HTML Report: Interactive web report with filtering and search capabilities

science_scraper/
├── main.py # Main script
├── downloader.py # Core downloader class
├── requirements.txt # Required packages
├── database/ # Database-specific modules
│ ├── __init__.py
│ ├── pubmed.py # PubMed search module
│ ├── pmc.py # PMC search module
│ └── ... # Other database modules
├── utils/ # Utility modules
│ ├── __init__.py
│ ├── pdf_generator.py # PDF generation utilities
│ └── html_report.py # HTML report generation
└── studies/ # Output directory
└── pdfs/ # Downloaded PDFs
You can extend the scraper by:
- Adding new database modules in the
database/
directory - Modifying the PDF generation in
utils/pdf_generator.py
- Customizing the HTML report in
utils/html_report.py
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This tool is designed for legitimate research purposes only. Please:
- Respect the terms of service of all scientific databases
- Use reasonable request delays to avoid overloading servers
- Only download open access content that you have the right to access
- Use the obtained papers for lawful research purposes
This project is licensed under the MIT License - see the LICENSE file for details.
This project was made possible thanks to:
- Open-source community - For the libraries and frameworks that form the foundation of this project
- Claude AI 3.7 Sonnet - For assistance with code development, debugging, and documentation
- Scientific publishing platforms - For making research accessible and inspiring this tool
- Friends and colleagues - For your support throughout the development process
Created with ❤️ for researchers and scientists worldwide.