Science Study Scraper

Science Study Scraper is a powerful tool designed to automatically find and download scientific studies from open access repositories on any topic in the scientific and medical field. This tool is part of https://bio-hacking.blog and is used as a research tool.

🔍 Features

Multi-Database Search: Search across PubMed, PMC, Europe PMC, bioRxiv, ScienceDirect, DOAJ, Semantic Scholar, and Google Scholar
Automatic PDF Downloads: Download full-text PDFs when available
Content Extraction: Create PDF documents from article content when direct PDFs are unavailable
Interactive HTML Reports: Generate beautiful, interactive reports of your search results
Flexible Query Building: Refine searches with additional terms and filters
Query Management: Save and load previous search queries
Test Mode: Try out the scraper with limited downloads before a full run

📋 Requirements

Python 3.7 or higher
Required packages: requests, beautifulsoup4, pandas, reportlab

🚀 Installation

Clone this repository:

git clone https://github.com/Ark0N/ScienceStudiesScraper.git
cd science-study-scraper

Install required packages:

cd science_scraper
pip install -r requirements.txt

📚 Usage

Basic Usage

Run a search with a specific query:

python main.py --query "diabetes treatment"

Use the test option to download only one study per database:

python main.py --test --query "nmn"

Advanced Options

python main.py --query "cancer immunotherapy" --terms "clinical trial" "systematic review" --max-results 50 --databases pubmed pmc --delay 2

Command Line Options

Option	Description
`--query`, `-q`	Main search query (required if no saved query)
`--terms`, `-t`	Additional search terms to refine results
`--output`, `-o`	Output directory for downloaded studies (default: 'studies')
`--max-results`, `-m`	Maximum number of results to retrieve (default: unlimited)
`--delay`, `-d`	Delay between requests in seconds (default: 1)
`--databases`	Databases to search (choices: pubmed, pmc, europepmc, biorxiv, sciencedirect, doaj, semanticscholar, googlescholar, all)
`--test`	Test mode: only download one study per database
`--save-query`	Save the current query for future use
`--load-saved`	Load the previously saved query

Example Workflows

Quick Test Run:

python main.py --query "alzheimer's disease" --test

Comprehensive Research:

python main.py --query "COVID-19" --terms "treatment" "vaccine" "long COVID" --databases all --save-query

Follow-up Research:

python main.py --load-saved --max-results 100

📊 Output

The Science Study Scraper generates several types of output:

PDFs: Downloaded and generated PDFs are stored in the studies/pdfs directory
CSV Data: Detailed study information in CSV format
JSON Data: Complete study data in JSON format
HTML Report: Interactive web report with filtering and search capabilities

📂 Project Structure

science_scraper/
├── main.py                  # Main script
├── downloader.py            # Core downloader class
├── requirements.txt         # Required packages
├── database/                # Database-specific modules
│   ├── __init__.py
│   ├── pubmed.py            # PubMed search module
│   ├── pmc.py               # PMC search module
│   └── ...                  # Other database modules
├── utils/                   # Utility modules
│   ├── __init__.py
│   ├── pdf_generator.py     # PDF generation utilities
│   └── html_report.py       # HTML report generation
└── studies/                 # Output directory
    └── pdfs/                # Downloaded PDFs

🛠️ Customization

You can extend the scraper by:

Adding new database modules in the database/ directory
Modifying the PDF generation in utils/pdf_generator.py
Customizing the HTML report in utils/html_report.py

📝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🔒 Ethical Use Guidelines

This tool is designed for legitimate research purposes only. Please:

Respect the terms of service of all scientific databases
Use reasonable request delays to avoid overloading servers
Only download open access content that you have the right to access
Use the obtained papers for lawful research purposes

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

This project was made possible thanks to:

Open-source community - For the libraries and frameworks that form the foundation of this project
Claude AI 3.7 Sonnet - For assistance with code development, debugging, and documentation
Scientific publishing platforms - For making research accessible and inspiring this tool
Friends and colleagues - For your support throughout the development process

Created with ❤️ for researchers and scientists worldwide.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
database		database
utils		utils
.gitignore		.gitignore
downloader.py		downloader.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Science Study Scraper

🔍 Features

📋 Requirements

🚀 Installation

📚 Usage

Basic Usage

Advanced Options

Command Line Options

Example Workflows

📊 Output

📂 Project Structure

🛠️ Customization

📝 Contributing

🔒 Ethical Use Guidelines

📄 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Ark0N/ScienceStudiesScraper

Folders and files

Latest commit

History

Repository files navigation

Science Study Scraper

🔍 Features

📋 Requirements

🚀 Installation

📚 Usage

Basic Usage

Advanced Options

Command Line Options

Example Workflows

📊 Output

📂 Project Structure

🛠️ Customization

📝 Contributing

🔒 Ethical Use Guidelines

📄 License

🙏 Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages