Skip to content

Commit b411f67

Browse files
committedJul 24, 2023
added readme
1 parent 9cfe5fa commit b411f67

File tree

1 file changed

+75
-0
lines changed

1 file changed

+75
-0
lines changed
 

‎Web scraping for book names/README.md

+75
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
**Title: Web Scraping Books and Creating DataFrame**
2+
3+
## Introduction
4+
This Python script is designed to scrape data from a website containing books in the "Mystery" category and create a DataFrame for further manipulation and preprocessing. It utilizes the `requests`, `BeautifulSoup`, and `pandas` libraries for web scraping and data manipulation.
5+
6+
## Requirements
7+
- Python 3.x
8+
- requests library
9+
- BeautifulSoup library
10+
- pandas library
11+
12+
## Installation
13+
1. Ensure you have Python 3.x installed on your system. If not, download it from the official Python website (https://www.python.org/downloads/) and install it.
14+
2. Install the required libraries by running the following commands in your terminal or command prompt:
15+
```
16+
pip install requests
17+
pip install beautifulsoup4
18+
pip install pandas
19+
```
20+
21+
## How to Use
22+
1. Clone or download the script from the GitHub repository (provide GitHub repository link here).
23+
2. Open the script using your favorite Python IDE or text editor.
24+
3. Modify the `url` variable in the script to point to the starting page of the "Mystery" books category you want to scrape.
25+
4. Run the script. It will scrape data from multiple pages of the category and store it in a DataFrame.
26+
5. The resulting DataFrame will contain information about book titles, prices, and star ratings.
27+
28+
## Usage Example
29+
```python
30+
import requests
31+
from bs4 import BeautifulSoup
32+
import pandas as pd
33+
34+
# a function for scraping content from the URL
35+
def scrape_url(url):
36+
response = requests.get(url)
37+
response = response.content
38+
soup = BeautifulSoup(response, 'html.parser')
39+
return soup
40+
41+
# Starting URL for the "Mystery" category
42+
url = 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html'
43+
print(scrape_url(url))
44+
45+
# extracting data from the content
46+
data1 = []
47+
for i in range(1, 51):
48+
url = f'https://books.toscrape.com/catalogue/page-{i}.html'
49+
response = requests.get(url)
50+
response = response.content
51+
soup = BeautifulSoup(response, 'html.parser')
52+
ol = soup.find('ol')
53+
articles = ol.find_all('article', class_='product_pod')
54+
55+
for article in articles:
56+
title_element = article.find('h3')
57+
title = title_element.get_text(strip=True)
58+
price_element = soup.find('p', class_='price_color')
59+
price = price_element.get_text(strip=True)
60+
star_element = article.find('p')
61+
star = star_element['class'][1] if star_element else None
62+
data1.append({"title": title, "Price": price, "Star": star})
63+
64+
# data stored in DataFrame to easily manipulate and preprocess
65+
df = pd.DataFrame(data1)
66+
```
67+
68+
## Output
69+
The script will produce a DataFrame containing information about the books in the "Mystery" category, including book titles, prices, and star ratings.
70+
71+
72+
## Author
73+
[@Hk669].
74+
75+
Feel free to use and modify this script as per your requirements. If you encounter any issues or have suggestions for improvements, please don't hesitate to create an issue or pull request on the GitHub repository. Happy scraping and data analysis!

0 commit comments

Comments
 (0)