This Java-based web scraper extracts metadata and PDF links from NIPS (NeurIPS) conference papers. It stores the data in a CSV file and downloads the PDFs to local directories by year.
- Scrapes paper metadata (title, authors, year, PDF link) from multiple years of NIPS.
- Downloads PDFs into year-specific folders.
- Stores metadata in
papers_metadata.csv. - Retry mechanism with exponential backoff for network errors.
- Progress bar for individual paper downloads and overall progress.
- Java 8 or higher
- Jsoup for HTML parsing
- Apache HttpClient for HTTP requests
- Clone this repository:
git clone https://github.com/yourusername/nips-papers-scraper.git
- Add the required dependencies: Jsoup and Apache HttpClient to your project.
- Compile and run the Scraper class.
- Run the Scraper class.
- The scraper will: -Scrape metadata and download PDFs from NIPS papers. -Store metadata in papers_metadata.csv. -Create year-specific directories to save PDFs.
- CSV Format: -"Title", "Year", "Authors", "PDF Link"
Progress bars are shown for each paper download and overall progress. Updates are printed in the terminal during the download process.
-
Thread Pool Size: Adjust the thread pool size (newFixedThreadPool(10)) for more or fewer threads.
-
CSV Path: Change the CSV_FILE_PATH constant to customize the CSV location.
-
Retries & Timeouts: Adjust the retry count and timeouts with constants like MAX_RETRIES and TIMEOUT.
Published by basim-12