Skip to content

Commit 18b417f

Browse files
committedNov 21, 2023
Second assignment
1 parent 63b7a97 commit 18b417f

File tree

1 file changed

+201
-0
lines changed

1 file changed

+201
-0
lines changed
 
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "3b577099",
6+
"metadata": {},
7+
"source": [
8+
"This is the second assignment of the Noisebridge Python Class! ([Noisebridge Wiki](https://www.noisebridge.net/wiki/PyClass) | [Github](https://github.com/audiodude/PythonClass))\n",
9+
"\n",
10+
"Here, we'd like to apply what we've learned about Web Scraping, SQL databases and the Pandas library. We will attempt to scrape song data from two different sources and provide a data set from the data we retrieve. Let's get started."
11+
]
12+
},
13+
{
14+
"cell_type": "markdown",
15+
"id": "5eedb731",
16+
"metadata": {},
17+
"source": [
18+
"## Data sources\n",
19+
"\n",
20+
"For this assignment, we will use the Billboard Hot 100 chart as well as the Spotify Weekly Top Songs Global chart.\n",
21+
"\n",
22+
"[Spotify](https://charts.spotify.com/home)\n",
23+
"\n",
24+
"[Billboard](https://www.billboard.com/charts/hot-100/)\n",
25+
"\n",
26+
"Using the Python [requests](https://requests.readthedocs.io/en/latest/) library and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), first write a program that scrapes each of these sites and stores the following data:\n",
27+
"\n",
28+
"- Song name\n",
29+
"- Artist name\n",
30+
"- Chart position\n",
31+
"- Date scraped\n",
32+
"\n",
33+
"For \"date scraped\" you can use the Python time library to get a UNIX timestamp:"
34+
]
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": 7,
39+
"id": "b5224149",
40+
"metadata": {},
41+
"outputs": [
42+
{
43+
"name": "stdout",
44+
"output_type": "stream",
45+
"text": [
46+
"1699323570\n"
47+
]
48+
}
49+
],
50+
"source": [
51+
"import time\n",
52+
"print(int(time.time()))"
53+
]
54+
},
55+
{
56+
"cell_type": "markdown",
57+
"id": "4d8eb8f7",
58+
"metadata": {},
59+
"source": [
60+
"Remember the following when web scraping:\n",
61+
"\n",
62+
"- Use the requests library to GET the source code of the site at the given URL\n",
63+
"- Pass the HTML code of the site to Beautiful Soup to create a queryable model of the page\n",
64+
"- Use the developer tools of your web browser to find the elements/classes/ids that the page uses to wrap the data\n",
65+
" - Hint: If you can't find an exact classname that represents a piece of a data, try iterating over all wrappers of that data\n",
66+
"- Use Beautiful Soup _selectors_ to grab the elements that you found and get the data inside"
67+
]
68+
},
69+
{
70+
"cell_type": "markdown",
71+
"id": "d6168bf0",
72+
"metadata": {},
73+
"source": [
74+
"You should structure your code using a function which returns each result, so that you can call this function in later parts of this assignment."
75+
]
76+
},
77+
{
78+
"cell_type": "markdown",
79+
"id": "a04db799",
80+
"metadata": {},
81+
"source": [
82+
"## Storing in SQL\n",
83+
"\n",
84+
"Now create a sqlite database using the schema we provide below. You can load the schema into a database called `songs.sqlite3` using the following code. The database file is stored next to this Python notebook in the Jupyterhub."
85+
]
86+
},
87+
{
88+
"cell_type": "code",
89+
"execution_count": 5,
90+
"id": "7d2d49ac",
91+
"metadata": {},
92+
"outputs": [],
93+
"source": [
94+
"import sqlite3\n",
95+
"from contextlib import closing\n",
96+
"\n",
97+
"with closing(sqlite3.connect('songs.sqlite3')) as db:\n",
98+
" with closing(db.cursor()) as cursor:\n",
99+
" cursor.execute('''\n",
100+
" CREATE TABLE IF NOT EXISTS songs (\n",
101+
" id INTEGER PRIMARY KEY,\n",
102+
" title VARCHAR(255),\n",
103+
" artist VARCHAR(255),\n",
104+
" billboard_rank INTEGER,\n",
105+
" spotify_rank INTEGER,\n",
106+
" scraped_on TIMESTAMP\n",
107+
" )\n",
108+
" ''')"
109+
]
110+
},
111+
{
112+
"cell_type": "markdown",
113+
"id": "dd5b3ab2",
114+
"metadata": {},
115+
"source": [
116+
"## Saving the songs to the sqlite database\n",
117+
"\n",
118+
"Once you've scraped the songs, you should write code that inserts a row into the sqlite database for each song. Remember to use placeholders to prevent opening your code up to SQL Injection attacks."
119+
]
120+
},
121+
{
122+
"cell_type": "markdown",
123+
"id": "bee5074c",
124+
"metadata": {},
125+
"source": [
126+
"## Creating a Pandas dataframe from the results\n",
127+
"\n",
128+
"Just like the above, except this time create a Pandas dataframe from the scraping results."
129+
]
130+
},
131+
{
132+
"cell_type": "markdown",
133+
"id": "0ac63f22",
134+
"metadata": {},
135+
"source": [
136+
"## Submitting\n",
137+
"To submit the assignment, follow these steps:\n",
138+
"\n",
139+
"1. In your Jupyter Notebook (this website), go to File -> Download as -> Notebook (.ipynb)\n",
140+
"1. Save that file somewhere on your computer where you can re-upload it.\n",
141+
"1. Follow this link to Dropbox: https://www.dropbox.com/request/Lk6adhuFEwSxxGX5tBX6 . The password is 'nbhack'.\n",
142+
"1. Upload your file in that folder. Only the course organizers (currently only Travis B) have access. No one else will see your submission, not even other students or yourself.\n",
143+
"1. If you realized you made a mistake and would like to submit again, that's fine, just try to change the file name before uploading again.\n",
144+
"\n",
145+
"If you'd like to receive feedback on your assignment, you can type your email address here:"
146+
]
147+
},
148+
{
149+
"cell_type": "code",
150+
"execution_count": 9,
151+
"id": "a5ee2a2b",
152+
"metadata": {},
153+
"outputs": [],
154+
"source": [
155+
"email = 'you@example.com'"
156+
]
157+
},
158+
{
159+
"cell_type": "code",
160+
"execution_count": null,
161+
"id": "3feb9e6e",
162+
"metadata": {},
163+
"outputs": [],
164+
"source": [
165+
"# Write your code here. This structure is just a suggestion to get you started, feel free to replace it.\n",
166+
"import requests\n",
167+
"from bs4 import BeautifulSoup\n",
168+
"\n",
169+
"def get_song(soup):\n",
170+
" pass\n",
171+
"\n",
172+
"def scrape_billboard():\n",
173+
" pass\n",
174+
"\n",
175+
"def scrape_spotify():\n",
176+
" pass"
177+
]
178+
}
179+
],
180+
"metadata": {
181+
"kernelspec": {
182+
"display_name": "Python 3 (ipykernel)",
183+
"language": "python",
184+
"name": "python3"
185+
},
186+
"language_info": {
187+
"codemirror_mode": {
188+
"name": "ipython",
189+
"version": 3
190+
},
191+
"file_extension": ".py",
192+
"mimetype": "text/x-python",
193+
"name": "python",
194+
"nbconvert_exporter": "python",
195+
"pygments_lexer": "ipython3",
196+
"version": "3.9.7"
197+
}
198+
},
199+
"nbformat": 4,
200+
"nbformat_minor": 5
201+
}

0 commit comments

Comments
 (0)