Skip to content

Commit 1fa5ece

Browse files
committed
Add cmoffitt_elizfitz
1 parent be92c7d commit 1fa5ece

File tree

4 files changed

+238
-0
lines changed

4 files changed

+238
-0
lines changed

.DS_Store

0 Bytes
Binary file not shown.

cmoffitt_elizfitz/README.txt

+73
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
Class: CS41
2+
Date: March 11, 2020
3+
Project Partners: Elizabeth Fitzgerald & Christopher Moffitt
4+
Google Drive URL for Presentation: https://drive.google.com/file/d/1G11pi4g7jpeK87NpjaQiVwEEr8vu6Rvx/view
5+
=========================================
6+
Requirements:
7+
-------------------------
8+
Simply download the code as is, and then run the main python script entitled wallscraper.py and the command line argument (wallpapers).
9+
ex:
10+
python wallscraper.py wallpapers
11+
12+
For periodical running extension (much more complicated).
13+
Save wallpapers.plist file to LaunchAgents folder.
14+
Then enter the following in Terminal in order to start the launchd:
15+
$ launchctl load ~/Library/LaunchAgents/wallpapers.plist
16+
$ launchctl start wallpapers
17+
18+
=========================================
19+
Technical Details:
20+
-------------------------
21+
This project performs several tasks. It scrapes data, downloads that data, avoids downloading the same image twice, allows for command line utility, and periodically runs itself. Each of these tasks was made up of smaller parts that came together to make it whole.
22+
-------------------------
23+
Task 1: Scraping Data
24+
-------------------------
25+
(A) Better familiarize ourselves with JSON objects and how to work with them
26+
--
27+
28+
(B) Write the query code to collect the JSON objects from reddit
29+
--
30+
31+
(C) Build a class for Reddit Posts that stores the most relevant JSON information as attributes
32+
-- In order to do this, we had to organize the collected JSON data into a single, neat dictionary in the __init__ function. The dictionary only collects certain attributes (those in the attr list). With these attributes in mind, the function goes through each post characteristic scraped from the JSON data, and stores only the desired attributes in a dictionary. If a post lacks a particular attribute, then the attribute is assigned the value of None.
33+
34+
-------------------------
35+
Task 2: Downloading Data
36+
-------------------------
37+
(A) Implement the download function in the RedditPost class (moderate)
38+
-- The download function only runs on posts whose urls contain the string ".jpg" or ".png"
39+
-- The download function sorts images into different folders based on their size, and titles them in the format "wallpapers/[image size]/[title].png".
40+
-- When creating this function, we had to deal with the case of the path to a folder not existing. The command "os.makedirs(path)" accounts for this.
41+
-- Once the path is known to exist, we use the requests package to collect the content of the post's url, and then write that to the existing filepath.
42+
43+
(B) Better familiarize ourselves with magic methods and implement the __str__(self) method
44+
-- This function simply allowed us to print out post data more cleanly by printing the object itself. It was fairly simple, and only required basic string concatenation.
45+
46+
(C) Test downloading one image
47+
-- We started by simply downloading the first collected image post, to confirm that it went to the right folder.
48+
49+
(D) Download all images generated by initial query
50+
-- Once we downloaded one image correctly, we simply ran all the RedditPost objects through a for loop in the main function.
51+
52+
-------------------------
53+
Task 3: Wallpaper Deduplication
54+
-------------------------
55+
(A) Keep track of previously seen images
56+
-- In order to keep track of previously seen images across different instances of the program, we used the pickle package.
57+
-- We created a list of seen wallpapers and saved it to the project folder as a pickle object. We then saved every new post's url content to the list.
58+
59+
(B) Check new images against those already seen
60+
-- With this object, we can load it any time we're saving a new post, scan through it to make sure there are no matching content, then save the new wallpaper's content to the list before dumping the list back into the pickle object.
61+
62+
-------------------------
63+
Task 4: Implementing Command Line Utility
64+
-------------------------
65+
(A) Allow the user to specify which subreddit posts to download through command line arguments
66+
-- We did this by importing the sys package
67+
-- We then just had to pass sys.argv[1] to our query function
68+
-------------------------
69+
Task 5: Configuring the script to run automatically
70+
-------------------------
71+
(A) Work with Mac LaunchAgents to have our wallscraper script run every hour
72+
-- Create a new .plist file for wallscraper in LaunchAgents and make the start interval for every hour
73+
-- Use terminal to start the launch

cmoffitt_elizfitz/wallpapers.plist

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
3+
<plist version="1.0">
4+
<dict>
5+
<key>Label</key>
6+
<string>wallpapers</string>
7+
<key>ProgramArguments</key>
8+
<array>
9+
<string>/usr/local/bin/python</string>
10+
<string>/Users/cmoffitt/cs41-env/Assignments/FinalProject/lab-5/wallscraper.py</string>
11+
<string> wallpapers</string>
12+
</array>
13+
<!-- Run every hour -->
14+
<key>StartInterval</key>
15+
<integer>3600</integer><!-- seconds -->
16+
</dict>
17+
</plist>

cmoffitt_elizfitz/wallscraper.py

+148
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Reddit Wallscraper
4+
Course: CS 41
5+
Name: Chris Moffit and Elizabeth Fitzgerald
6+
SUNet: cmoffitt and elizfitz
7+
8+
Replace this with a description of the program.
9+
"""
10+
# import utils
11+
import requests
12+
import sys
13+
import re
14+
import os
15+
import pickle
16+
17+
#Uses requests module to query reddit for json file corresponding to subreddit
18+
def query(subreddit):
19+
URL_START = "https://reddit.com/r/"
20+
URL_END = ".json"
21+
url = URL_START + subreddit + URL_END
22+
print(url)
23+
headers = {'User-Agent': "Wallscraper Script by @cmoffitt"}
24+
r = None
25+
26+
# Make request and catch exceptions
27+
try:
28+
r = requests.get(url, headers=headers)
29+
r.raise_for_status()
30+
except requests.exceptions.HTTPError as errh:
31+
print("Http Error:", errh)
32+
sys.exit(1)
33+
except requests.exceptions.ConnectionError as errc:
34+
print("Error Connecting: No internet connection")
35+
sys.exit(1)
36+
except requests.exceptions.Timeout as errt:
37+
print("Timeout Error:", errt)
38+
sys.exit(1)
39+
except requests.exceptions.RequestException as err:
40+
print("OOps: Something Else", err)
41+
sys.exit(1)
42+
43+
# Capture json dict object of subreddit if response was successful
44+
print(r)
45+
if r.ok:
46+
json_data = r.json()
47+
else:
48+
print("The server did not return a successful response. Please try again")
49+
sys.exit(1)
50+
51+
# Check if valid subreddit
52+
if not isValidSubreddit(json_data):
53+
print("Not a valid subreddit. Please try again.")
54+
sys.exit(1)
55+
56+
return json_data
57+
58+
#Class defining one reddit post
59+
class RedditPost:
60+
#Initializes one reddit post as a dictionary storing certain attributes from the json post object
61+
def __init__(self, data):
62+
post_data = data
63+
attr = ["subreddit", "is_self", "ups", "post_hint", "title", "downs", "score", "url", "domain", "permalink", "created_utc", "num_comments", "preview", "name", "over_18"]
64+
65+
dict = {}
66+
for k in attr:
67+
try:
68+
dict[k] = post_data["data"][k]
69+
except:
70+
dict[k] = None
71+
72+
self.data = dict
73+
74+
#Downloads the post image to a file on the computer, preventing duplicate image downloading
75+
def download(self):
76+
if ".jpg" in self.data["url"] or ".png" in self.data["url"]: #only download if it actually is an image
77+
#format the correct name and path for the file
78+
name = re.sub(r'\[.*\]', '', self.data["title"])
79+
name = re.sub(" ", "", name)
80+
name = re.sub(r'[^a-zA-Z0-9]', "", name)
81+
path = "wallpapers/" + str(self.data["preview"]["images"][0]["source"]["width"]) + "x" + str(self.data["preview"]["images"][0]["source"]["height"]) + "/"
82+
filename = name + ".png"
83+
84+
if not os.path.exists(path):
85+
os.makedirs(path)
86+
87+
img_data = requests.get(self.data["url"]).content #unique info regarding the particular image to save in order to prevent duplicate image downloading
88+
89+
#Run this code the first time in order to create the pickle file for seen_wallpapers
90+
#seen_wallpapers.append(img_data)
91+
#f = open("seen_wallpapers.pickle", 'wb')
92+
#pickle.dump(seen_wallpapers, f)
93+
#f.close()
94+
95+
#upload seen_wallpapers pickle file to compare against img_data and prevent duplicae image downloading
96+
seen_wallpapers = []
97+
f = open("seen_wallpapers.pickle", 'rb')
98+
seen_wallpapers = pickle.load(f)
99+
f.close()
100+
if img_data not in seen_wallpapers:
101+
seen_wallpapers.append(img_data)
102+
f = open("seen_wallpapers.pickle", 'wb')
103+
pickle.dump(seen_wallpapers, f)
104+
f.close()
105+
#save image in file
106+
with open(os.path.join(path, filename), 'wb') as temp_file:
107+
temp_file.write(img_data)
108+
temp_file.close()
109+
110+
else:
111+
pass
112+
113+
def __str__(self):
114+
#"RedditPost({title} ({score}): {url})
115+
string = "RedditPost({" + self.data["title"] + "} ({" + str(self.data["score"]) + "}): {" + self.data["url"] + "})"
116+
return string
117+
118+
119+
120+
# Checks if valid subreddit by making sure json dict object is properly filled with contents
121+
def isValidSubreddit(json_data):
122+
if json_data['data']['dist'] == 0:
123+
return False
124+
else:
125+
return True
126+
127+
128+
def main(subreddit):
129+
q = query(subreddit)
130+
131+
children = (q['data']['children'])
132+
postCount = 0 # To confirm we have all 25 "posts"
133+
scoreCount = 0 # To check number of posts with a score above 500
134+
135+
RedditPosts = [RedditPost(post) for post in children]
136+
137+
for post in RedditPosts:
138+
new_post = post
139+
postCount += 1
140+
if new_post.data["score"] > 500:
141+
scoreCount += 1
142+
post.download()
143+
144+
print("There were " + str(postCount) + " posts.")
145+
print(str(scoreCount) + " of those posts had a score over 500.")
146+
147+
if __name__ == '__main__':
148+
main(sys.argv[1])

0 commit comments

Comments
 (0)