This Python script is designed to scrape data from a webpage, including links, images, titles, and emails, and save the extracted data into an Excel file.
- Extracts inner and outer links from the webpage.
- Downloads images from the webpage and saves them locally.
- Retrieves titles and emails from the webpage.
- Generates an Excel file containing the extracted data.
- Python 3.x
- Selenium
- Requests
- BeautifulSoup4
- Pandas
- ChromeDriver (for Selenium WebDriver)
-
Clone or download the repository.
-
Install the required Python packages using pip:
pip install -r requirements.txtMake sure to run this command from the directory where the
requirements.txtfile is located. -
Use the ChromeDriver in the git repository or download the latest ChromeDriver executable and place it in your system PATH and specify the path to it in the script.
- Instantiate the
webScraperclass with the URL of the webpage you want to scrape.url = "https://example.com" bot = webScraper(url)
- Run all scraping functions using the
runAllFunctions()method.bot.runAllFunctions()
- Generate the Excel file containing the scraped data using the
makeExcelSheet()method.bot.makeExcelSheet()
url = "https://www.monolithai.com/blog/4-ways-ai-is-changing-the-packaging-industry"
bot = webScraper(url)
bot.runAllFunctions()
bot.makeExcelSheet()If you run above snippet (which is in python file by default) you get
- An Excel file named
monolithai.xlsxcontaining the scraped data will be generated after running the script. - Images from the webpage is saved in a directory named
monolithai.
example excel file and images directory are in repository
Note: Please make sure to have proper permissions to create directories and write files in the script execution directory.
- Abhai Matta