Second assignment

audiodude · audiodude · commit 18b417f38c39 · 2023-11-20T19:04:44.000-08:00
diff --git a/assignments/series_2/02-web-scraping/02-assignment.ipynb b/assignments/series_2/02-web-scraping/02-assignment.ipynb
@@ -0,0 +1,201 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "3b577099",
+   "metadata": {},
+   "source": [
+    "This is the second assignment of the Noisebridge Python Class! ([Noisebridge Wiki](https://www.noisebridge.net/wiki/PyClass) | [Github](https://github.com/audiodude/PythonClass))\n",
+    "\n",
+    "Here, we'd like to apply what we've learned about Web Scraping, SQL databases and the Pandas library. We will attempt to scrape song data from two different sources and provide a data set from the data we retrieve. Let's get started."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5eedb731",
+   "metadata": {},
+   "source": [
+    "## Data sources\n",
+    "\n",
+    "For this assignment, we will use the Billboard Hot 100 chart as well as the Spotify Weekly Top Songs Global chart.\n",
+    "\n",
+    "[Spotify](https://charts.spotify.com/home)\n",
+    "\n",
+    "[Billboard](https://www.billboard.com/charts/hot-100/)\n",
+    "\n",
+    "Using the Python [requests](https://requests.readthedocs.io/en/latest/) library and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), first write a program that scrapes each of these sites and stores the following data:\n",
+    "\n",
+    "- Song name\n",
+    "- Artist name\n",
+    "- Chart position\n",
+    "- Date scraped\n",
+    "\n",
+    "For \"date scraped\" you can use the Python time library to get a UNIX timestamp:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "b5224149",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1699323570\n"
+     ]
+    }
+   ],
+   "source": [
+    "import time\n",
+    "print(int(time.time()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4d8eb8f7",
+   "metadata": {},
+   "source": [
+    "Remember the following when web scraping:\n",
+    "\n",
+    "- Use the requests library to GET the source code of the site at the given URL\n",
+    "- Pass the HTML code of the site to Beautiful Soup to create a queryable model of the page\n",
+    "- Use the developer tools of your web browser to find the elements/classes/ids that the page uses to wrap the data\n",
+    "    - Hint: If you can't find an exact classname that represents a piece of a data, try iterating over all wrappers of that data\n",
+    "- Use Beautiful Soup _selectors_ to grab the elements that you found and get the data inside"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d6168bf0",
+   "metadata": {},
+   "source": [
+    "You should structure your code using a function which returns each result, so that you can call this function in later parts of this assignment."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a04db799",
+   "metadata": {},
+   "source": [
+    "## Storing in SQL\n",
+    "\n",
+    "Now create a sqlite database using the schema we provide below. You can load the schema into a database called `songs.sqlite3` using the following code. The database file is stored next to this Python notebook in the Jupyterhub."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "7d2d49ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sqlite3\n",
+    "from contextlib import closing\n",
+    "\n",
+    "with closing(sqlite3.connect('songs.sqlite3')) as db:\n",
+    "    with closing(db.cursor()) as cursor:\n",
+    "        cursor.execute('''\n",
+    "            CREATE TABLE IF NOT EXISTS songs (\n",
+    "               id INTEGER PRIMARY KEY,\n",
+    "               title VARCHAR(255),\n",
+    "               artist VARCHAR(255),\n",
+    "               billboard_rank INTEGER,\n",
+    "               spotify_rank INTEGER,\n",
+    "               scraped_on TIMESTAMP\n",
+    "            )\n",
+    "        ''')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd5b3ab2",
+   "metadata": {},
+   "source": [
+    "## Saving the songs to the sqlite database\n",
+    "\n",
+    "Once you've scraped the songs, you should write code that inserts a row into the sqlite database for each song. Remember to use placeholders to prevent opening your code up to SQL Injection attacks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bee5074c",
+   "metadata": {},
+   "source": [
+    "## Creating a Pandas dataframe from the results\n",
+    "\n",
+    "Just like the above, except this time create a Pandas dataframe from the scraping results."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ac63f22",
+   "metadata": {},
+   "source": [
+    "## Submitting\n",
+    "To submit the assignment, follow these steps:\n",
+    "\n",
+    "1. In your Jupyter Notebook (this website), go to File -> Download as -> Notebook (.ipynb)\n",
+    "1. Save that file somewhere on your computer where you can re-upload it.\n",
+    "1. Follow this link to Dropbox: https://www.dropbox.com/request/Lk6adhuFEwSxxGX5tBX6 . The password is 'nbhack'.\n",
+    "1. Upload your file in that folder. Only the course organizers (currently only Travis B) have access. No one else will see your submission, not even other students or yourself.\n",
+    "1. If you realized you made a mistake and would like to submit again, that's fine, just try to change the file name before uploading again.\n",
+    "\n",
+    "If you'd like to receive feedback on your assignment, you can type your email address here:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "a5ee2a2b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "email = 'you@example.com'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3feb9e6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here. This structure is just a suggestion to get you started, feel free to replace it.\n",
+    "import requests\n",
+    "from bs4 import BeautifulSoup\n",
+    "\n",
+    "def get_song(soup):\n",
+    "    pass\n",
+    "\n",
+    "def scrape_billboard():\n",
+    "    pass\n",
+    "\n",
+    "def scrape_spotify():\n",
+    "    pass"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}