|
| 1 | +**Title: Web Scraping Books and Creating DataFrame** |
| 2 | + |
| 3 | +## Introduction |
| 4 | +This Python script is designed to scrape data from a website containing books in the "Mystery" category and create a DataFrame for further manipulation and preprocessing. It utilizes the `requests`, `BeautifulSoup`, and `pandas` libraries for web scraping and data manipulation. |
| 5 | + |
| 6 | +## Requirements |
| 7 | +- Python 3.x |
| 8 | +- requests library |
| 9 | +- BeautifulSoup library |
| 10 | +- pandas library |
| 11 | + |
| 12 | +## Installation |
| 13 | +1. Ensure you have Python 3.x installed on your system. If not, download it from the official Python website (https://www.python.org/downloads/) and install it. |
| 14 | +2. Install the required libraries by running the following commands in your terminal or command prompt: |
| 15 | +``` |
| 16 | +pip install requests |
| 17 | +pip install beautifulsoup4 |
| 18 | +pip install pandas |
| 19 | +``` |
| 20 | + |
| 21 | +## How to Use |
| 22 | +1. Clone or download the script from the GitHub repository (provide GitHub repository link here). |
| 23 | +2. Open the script using your favorite Python IDE or text editor. |
| 24 | +3. Modify the `url` variable in the script to point to the starting page of the "Mystery" books category you want to scrape. |
| 25 | +4. Run the script. It will scrape data from multiple pages of the category and store it in a DataFrame. |
| 26 | +5. The resulting DataFrame will contain information about book titles, prices, and star ratings. |
| 27 | + |
| 28 | +## Usage Example |
| 29 | +```python |
| 30 | +import requests |
| 31 | +from bs4 import BeautifulSoup |
| 32 | +import pandas as pd |
| 33 | + |
| 34 | +# a function for scraping content from the URL |
| 35 | +def scrape_url(url): |
| 36 | + response = requests.get(url) |
| 37 | + response = response.content |
| 38 | + soup = BeautifulSoup(response, 'html.parser') |
| 39 | + return soup |
| 40 | + |
| 41 | +# Starting URL for the "Mystery" category |
| 42 | +url = 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html' |
| 43 | +print(scrape_url(url)) |
| 44 | + |
| 45 | +# extracting data from the content |
| 46 | +data1 = [] |
| 47 | +for i in range(1, 51): |
| 48 | + url = f'https://books.toscrape.com/catalogue/page-{i}.html' |
| 49 | + response = requests.get(url) |
| 50 | + response = response.content |
| 51 | + soup = BeautifulSoup(response, 'html.parser') |
| 52 | + ol = soup.find('ol') |
| 53 | + articles = ol.find_all('article', class_='product_pod') |
| 54 | + |
| 55 | + for article in articles: |
| 56 | + title_element = article.find('h3') |
| 57 | + title = title_element.get_text(strip=True) |
| 58 | + price_element = soup.find('p', class_='price_color') |
| 59 | + price = price_element.get_text(strip=True) |
| 60 | + star_element = article.find('p') |
| 61 | + star = star_element['class'][1] if star_element else None |
| 62 | + data1.append({"title": title, "Price": price, "Star": star}) |
| 63 | + |
| 64 | +# data stored in DataFrame to easily manipulate and preprocess |
| 65 | +df = pd.DataFrame(data1) |
| 66 | +``` |
| 67 | + |
| 68 | +## Output |
| 69 | +The script will produce a DataFrame containing information about the books in the "Mystery" category, including book titles, prices, and star ratings. |
| 70 | + |
| 71 | + |
| 72 | +## Author |
| 73 | +[@Hk669]. |
| 74 | + |
| 75 | +Feel free to use and modify this script as per your requirements. If you encounter any issues or have suggestions for improvements, please don't hesitate to create an issue or pull request on the GitHub repository. Happy scraping and data analysis! |
0 commit comments