Skip to content

Commit 54a4fb0

Browse files
authored
as far as i can tell, these aren't changing squat (#149)
Resolving conflicts unapproved PR, just going to merge
1 parent 3d7b501 commit 54a4fb0

File tree

19 files changed

+5327
-5327
lines changed

19 files changed

+5327
-5327
lines changed
Lines changed: 55 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,55 @@
1-
import sys
2-
import os
3-
from pathlib import Path
4-
5-
p = Path(__file__).resolve().parents[5]
6-
sys.path.insert(1, str(p))
7-
from common import list_pdf_v2
8-
9-
"""
10-
SETUP HOW-TO:
11-
Step 1: Set webpage to the page you want to scrape.
12-
Step 2: Click the links that lead to the files, and copy their paths.
13-
For example, http://www.beverlyhills.org/cbhfiles/storage/files/long_num/file.pdf would become /cbhfiles/storage/files/long_num/
14-
**NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
15-
Also ensure that domain stays the same (I've seen some sites use AWS buckets for one file and an on-site storage method for another)
16-
Verify* on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
17-
Step 3: If the domain is not in the href, set domain_included to False, otherwise set it to True
18-
Step 4: If you set domain_included to False, you need to add the domain (from the http(s) to the top level domain (TLD) (.com, .edu, etc),
19-
otherwise, you can leave it blank.
20-
Step 5: Set sleep_time to the desired integer. Best practice is to set it to the crawl-delay in a website's `robots.txt`.
21-
Most departments do not seem to have a crawl-delay specified, so leave it at 5 (If it's not there).
22-
Step 6: (Only applies to list_pdf_v3) If there are any documents that you *don't* want to scrape from the page,
23-
put the words that are **unique** to them.
24-
Step 7: "debug" is will make the scraper more verbose, but will generally be unhelpful to the average user. Leave False unless you're having issues.
25-
"csv_dir" is better explained in the readme.
26-
27-
\* Verify this using your browser's developer pane using select element AKA Node Select
28-
29-
EXAMPLE CONFIG:
30-
configs = {
31-
"webpage": "http://www.beverlyhills.org/departments/policedepartment/crimeinformation/crimestatistics/web.jsp",
32-
"web_path": "/cbhfiles/storage/files/",
33-
"domain_included": False,
34-
"domain": "http://www.beverlyhills.org",
35-
"sleep_time": 5,
36-
"non_important": ["emergency", "training", "guidelines"],
37-
"debug": False,
38-
"csv_dir": "/csv/",
39-
}
40-
"""
41-
42-
43-
configs = {
44-
"webpage": "",
45-
"web_path": "",
46-
"domain_included": False,
47-
"domain": "",
48-
"sleep_time": 5,
49-
"debug": False,
50-
"csv_dir": "/csv/",
51-
}
52-
53-
save_dir = "./data/"
54-
55-
list_pdf_v2(configs, save_dir)
1+
import sys
2+
import os
3+
from pathlib import Path
4+
5+
p = Path(__file__).resolve().parents[5]
6+
sys.path.insert(1, str(p))
7+
from common import list_pdf_v2
8+
9+
"""
10+
SETUP HOW-TO:
11+
Step 1: Set webpage to the page you want to scrape.
12+
Step 2: Click the links that lead to the files, and copy their paths.
13+
For example, http://www.beverlyhills.org/cbhfiles/storage/files/long_num/file.pdf would become /cbhfiles/storage/files/long_num/
14+
**NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
15+
Also ensure that domain stays the same (I've seen some sites use AWS buckets for one file and an on-site storage method for another)
16+
Verify* on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
17+
Step 3: If the domain is not in the href, set domain_included to False, otherwise set it to True
18+
Step 4: If you set domain_included to False, you need to add the domain (from the http(s) to the top level domain (TLD) (.com, .edu, etc),
19+
otherwise, you can leave it blank.
20+
Step 5: Set sleep_time to the desired integer. Best practice is to set it to the crawl-delay in a website's `robots.txt`.
21+
Most departments do not seem to have a crawl-delay specified, so leave it at 5 (If it's not there).
22+
Step 6: (Only applies to list_pdf_v3) If there are any documents that you *don't* want to scrape from the page,
23+
put the words that are **unique** to them.
24+
Step 7: "debug" is will make the scraper more verbose, but will generally be unhelpful to the average user. Leave False unless you're having issues.
25+
"csv_dir" is better explained in the readme.
26+
27+
\* Verify this using your browser's developer pane using select element AKA Node Select
28+
29+
EXAMPLE CONFIG:
30+
configs = {
31+
"webpage": "http://www.beverlyhills.org/departments/policedepartment/crimeinformation/crimestatistics/web.jsp",
32+
"web_path": "/cbhfiles/storage/files/",
33+
"domain_included": False,
34+
"domain": "http://www.beverlyhills.org",
35+
"sleep_time": 5,
36+
"non_important": ["emergency", "training", "guidelines"],
37+
"debug": False,
38+
"csv_dir": "/csv/",
39+
}
40+
"""
41+
42+
43+
configs = {
44+
"webpage": "",
45+
"web_path": "",
46+
"domain_included": False,
47+
"domain": "",
48+
"sleep_time": 5,
49+
"debug": False,
50+
"csv_dir": "/csv/",
51+
}
52+
53+
save_dir = "./data/"
54+
55+
list_pdf_v2(configs, save_dir)
Lines changed: 54 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,54 @@
1-
import sys
2-
import os
3-
from pathlib import Path
4-
5-
p = Path(__file__).resolve().parents[5]
6-
sys.path.insert(1, str(p))
7-
from common import list_pdf_v3
8-
9-
"""
10-
SETUP HOW-TO:
11-
Step 1: Set webpage to the page you want to scrape.
12-
Step 2: Click the links that lead to the files, and copy their paths.
13-
For example, http://www.beverlyhills.org/cbhfiles/storage/files/long_num/file.pdf would become /cbhfiles/storage/files/long_num/
14-
**NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
15-
Also ensure that domain stays the same (I've seen some sites use AWS buckets for one file and an on-site storage method for another)
16-
Verify* on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
17-
Step 3: If the domain is not in the href, set domain_included to False, otherwise set it to True
18-
Step 4: If you set domain_included to False, you need to add the domain (from the http(s) to the top level domain (TLD) (.com, .edu, etc),
19-
otherwise, you can leave it blank.
20-
Step 5: Set sleep_time to the desired integer. Best practice is to set it to the crawl-delay in a website's `robots.txt`.
21-
Most departments do not seem to have a crawl-delay specified, so leave it at 5 (If it's not there).
22-
Step 6: (Only applies to list_pdf_v3) If there are any documents that you *don't* want to scrape from the page,
23-
put the words that are **unique** to them.
24-
Step 7: "debug" is will make the scraper more verbose, but will generally be unhelpful to the average user. Leave False unless you're having issues.
25-
"csv_dir" is better explained in the readme.
26-
Step 8: If you (for whatever reason) don't like where the scraper is saving the data, your can change this path (by either completely changing it or adding subfolders, both are supported.)
27-
28-
EXAMPLE CONFIG:
29-
configs = {
30-
"webpage": "http://www.beverlyhills.org/departments/policedepartment/crimeinformation/crimestatistics/web.jsp",
31-
"web_path": "/cbhfiles/storage/files/",
32-
"domain_included": False,
33-
"domain": "http://www.beverlyhills.org",
34-
"sleep_time": 5,
35-
"non_important": ["emergency", "training", "guidelines"],
36-
"debug": False,
37-
"csv_dir": "/csv/",
38-
}
39-
"""
40-
41-
configs = {
42-
"webpage": "",
43-
"web_path": "",
44-
"domain_included": False,
45-
"domain": "",
46-
"sleep_time": 5,
47-
"non_important": [],
48-
"debug": False,
49-
"csv_dir": "/csv/",
50-
}
51-
52-
save_dir = "./data/"
53-
54-
list_pdf_v3(configs, save_dir)
1+
import sys
2+
import os
3+
from pathlib import Path
4+
5+
p = Path(__file__).resolve().parents[5]
6+
sys.path.insert(1, str(p))
7+
from common import list_pdf_v3
8+
9+
"""
10+
SETUP HOW-TO:
11+
Step 1: Set webpage to the page you want to scrape.
12+
Step 2: Click the links that lead to the files, and copy their paths.
13+
For example, http://www.beverlyhills.org/cbhfiles/storage/files/long_num/file.pdf would become /cbhfiles/storage/files/long_num/
14+
**NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
15+
Also ensure that domain stays the same (I've seen some sites use AWS buckets for one file and an on-site storage method for another)
16+
Verify* on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
17+
Step 3: If the domain is not in the href, set domain_included to False, otherwise set it to True
18+
Step 4: If you set domain_included to False, you need to add the domain (from the http(s) to the top level domain (TLD) (.com, .edu, etc),
19+
otherwise, you can leave it blank.
20+
Step 5: Set sleep_time to the desired integer. Best practice is to set it to the crawl-delay in a website's `robots.txt`.
21+
Most departments do not seem to have a crawl-delay specified, so leave it at 5 (If it's not there).
22+
Step 6: (Only applies to list_pdf_v3) If there are any documents that you *don't* want to scrape from the page,
23+
put the words that are **unique** to them.
24+
Step 7: "debug" is will make the scraper more verbose, but will generally be unhelpful to the average user. Leave False unless you're having issues.
25+
"csv_dir" is better explained in the readme.
26+
Step 8: If you (for whatever reason) don't like where the scraper is saving the data, your can change this path (by either completely changing it or adding subfolders, both are supported.)
27+
28+
EXAMPLE CONFIG:
29+
configs = {
30+
"webpage": "http://www.beverlyhills.org/departments/policedepartment/crimeinformation/crimestatistics/web.jsp",
31+
"web_path": "/cbhfiles/storage/files/",
32+
"domain_included": False,
33+
"domain": "http://www.beverlyhills.org",
34+
"sleep_time": 5,
35+
"non_important": ["emergency", "training", "guidelines"],
36+
"debug": False,
37+
"csv_dir": "/csv/",
38+
}
39+
"""
40+
41+
configs = {
42+
"webpage": "",
43+
"web_path": "",
44+
"domain_included": False,
45+
"domain": "",
46+
"sleep_time": 5,
47+
"non_important": [],
48+
"debug": False,
49+
"csv_dir": "/csv/",
50+
}
51+
52+
save_dir = "./data/"
53+
54+
list_pdf_v3(configs, save_dir)
Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
1-
# Setup
2-
3-
Within `configs.py`:
4-
1. Set `url` to the url
5-
1. set `department_code` to the first few letters of the url, and make them all capital. For example, the `department_code` of `https://hsupd.crimegraphics.com/2013/default.aspx` would be `HSUPD`.
6-
1. `list_header`, This shouldn't need any changing, as it's just translating the columns into our `Fields`
7-
8-
# Module
9-
10-
The `crimegraphics_scraper` module requires two arguments, the `configs`, and the `save_dir`. Should you want performance stats, add `stats=True` as an argument.
11-
12-
# Info
13-
The scripts should likely be run daily. They will only save the data if the hash (generated from the table) is different. Otherwise, it will simply exit.
1+
# Setup
2+
3+
Within `configs.py`:
4+
1. Set `url` to the url
5+
1. set `department_code` to the first few letters of the url, and make them all capital. For example, the `department_code` of `https://hsupd.crimegraphics.com/2013/default.aspx` would be `HSUPD`.
6+
1. `list_header`, This shouldn't need any changing, as it's just translating the columns into our `Fields`
7+
8+
# Module
9+
10+
The `crimegraphics_scraper` module requires two arguments, the `configs`, and the `save_dir`. Should you want performance stats, add `stats=True` as an argument.
11+
12+
# Info
13+
The scripts should likely be run daily. They will only save the data if the hash (generated from the table) is different. Otherwise, it will simply exit.

0 commit comments

Comments
 (0)