You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+74-13Lines changed: 74 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,18 +28,48 @@ Extending the [CCSparkJob](./sparkcc.py) isn't difficult and for many use cases
28
28
29
29
## Setup
30
30
31
-
To develop and test locally, you will need to install
32
-
* Spark, see the [detailed instructions](https://spark.apache.org/docs/latest/), and
33
-
* all required Python modules by running
31
+
To develop and test locally, you'll need **Python>=3.9** and **Spark**.
32
+
33
+
### JRE
34
+
35
+
Spark requires a 64-bit Java JRE (v8, 11, or 17 for Spark 3.5.7). Install this first. If you have an Apple Silicon device, Azul Zulu JRE is recommended for native architecture support. Ensure that either `java` is on your `$PATH` or the `$JAVA_HOME` env var points to your JRE.
36
+
37
+
### Python dependencies
38
+
39
+
Assuming you have Python already setup and a venv activated, install the `cc-pyspark` dependencies:
40
+
34
41
```
35
42
pip install -r requirements.txt
36
43
```
37
-
* (optionally, and only if you want to query the columnar index) [install S3 support libraries](#installation-of-s3-support-libraries) so that Spark can load the columnar index from S3
38
44
45
+
#### If you want to query the columnar index:
46
+
In addition to the above, [install S3 support libraries](#installation-of-s3-support-libraries) so that Spark can load the columnar index from S3.
47
+
48
+
### Spark
49
+
50
+
There are two ways to obtain Spark:
51
+
* manual installation / preinstallation
52
+
* as a pip package with `pip install`
53
+
54
+
#### For simple development or to get started quickly, the `pip install` route is recommended:
55
+
56
+
```bash
57
+
pip install pyspark==3.5.7
58
+
```
59
+
60
+
This will install v3.5.7 of [the PySpark python package](https://spark.apache.org/docs/latest/api/python/getting_started/index.html), which includes a local/client-only version of Spark and also adds `spark-submit` and `pyspark` to your `$PATH`.
61
+
62
+
> If you need to interact with a remote Spark cluster, use a version of PySpark that matches the cluster version.
63
+
64
+
#### If Spark is already installed or if you want full tooling to configure a local Spark cluster:
65
+
66
+
Install Spark if (see the [Spark documentation](https://spark.apache.org/docs/latest/) for guidance). Then, ensure that `spark-submit` and `pyspark` are on your `$PATH`, or prepend `$SPARK_HOME/bin` when running eg `$SPARK_HOME/bin/spark-submit`.
67
+
68
+
> Note: The PySpark package is required if you want to run the tests in `test/`.
39
69
40
70
## Compatibility and Requirements
41
71
42
-
Tested with with Spark 3.2.3, 3.3.2, 3.4.1, 3.5.5 in combination with Python 3.8, 3.9, 3.10, 3.12 and 3.13. See the branch [python-2.7](/commoncrawl/cc-pyspark/tree/python-2.7) if you want to run the job on Python 2.7 and older Spark versions.
72
+
Tested with Spark 3.2.3, 3.3.2, 3.4.1, 3.5.5 in combination with Python 3.8, 3.9, 3.10, 3.12 and 3.13. See the branch [python-2.7](/commoncrawl/cc-pyspark/tree/python-2.7) if you want to run the job on Python 2.7 and older Spark versions.
43
73
44
74
45
75
## Get Sample Data
@@ -62,11 +92,10 @@ CC-PySpark reads the list of input files from a manifest file. Typically, these
62
92
63
93
### Running locally
64
94
65
-
First, point the environment variable `SPARK_HOME` to your Spark installation.
66
-
Then submit a job via
95
+
Spark jobs can be started using `spark-submit` (see [Setup](#setup) above if you have a manual installation of Spark):
67
96
68
97
```
69
-
$SPARK_HOME/bin/spark-submit ./server_count.py \
98
+
spark-submit ./server_count.py \
70
99
--num_output_partitions 1 --log_level WARN \
71
100
./input/test_warc.txt servernames
72
101
```
@@ -76,7 +105,7 @@ This will count web server names sent in HTTP response headers for the sample WA
76
105
The output table can be accessed via SparkSQL, e.g.,
If the `.py` file for the job you want to debug is runnable (i.e. if it has a `if __name__ == "__main__":` line), you can bypass `spark-submit` and run it directly as a Python script:
Spark will complain if the output directory exists - you may want to add a preprocessing step that deletes the appropriate subdirectory under `spark-warehouse` before each run, eg `rm -rf wpark-warehouse/servernames`.
139
+
140
+
> If you have manually installed Spark you'll need to ensure the pyspark package is on your PYTHONPATH:
Note that the `run_job` code is still invoked by the Spark Java binary behind the scenes, which normally prevents a debugger from attaching. To debug the `run_job` internals, it's recommended to set up a unit test and debug that; see `test/test_sitemaps_from_robotstxt`for examples of single and batched job tests.
146
+
101
147
102
148
### Running in Spark cluster over large amounts of data
103
149
@@ -116,7 +162,7 @@ As the Common Crawl dataset lives in the Amazon Public Datasets program, you can
116
162
117
163
All examples show the available command-line options if called with the parameter `--help` or `-h`, e.g.
Below an example call to count words in 10 WARC records host under the `.is` top-level domain using the `--packages` option:
172
218
```
173
-
$SPARK_HOME/bin/spark-submit \
219
+
spark-submit \
174
220
--packages org.apache.hadoop:hadoop-aws:3.3.2 \
175
221
./cc_index_word_count.py \
176
222
--input_base_url s3://commoncrawl/ \
@@ -210,6 +256,21 @@ Some differences between the warcio and FastWARC APIs are hidden from the user i
210
256
However, it's recommended that you carefully verify that your custom job implementation works in combination with FastWARC. There are subtle differences between the warcio and FastWARC APIs, including the underlying classes (WARC/HTTP headers and stream implementations). In addition, FastWARC does not support for legacy ARC files and does not automatically decode HTTP content and transfer encodings (see [Resiliparse HTTP Tools](https://resiliparse.chatnoir.eu/en/latest/man/parse/http.html#read-chunked-http-payloads)). While content and transfer encodings are already decoded in Common Crawl WARC files, this may not be the case for WARC files from other sources. See also [WARC 1.1 specification, http/https response records](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#http-and-https-schemes).
211
257
212
258
259
+
## Running the Tests
260
+
261
+
To run the tests in `test/` you will need to add `.` and `test` to the PYTHONPATH:
262
+
263
+
```bash
264
+
PYTHONPATH=$PYTHONPATH:.:test pytest -v test
265
+
```
266
+
267
+
or if you have a manual installation of Spark:
268
+
269
+
```bash
270
+
PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:.:test pytest -v test
271
+
```
272
+
273
+
213
274
## Credits
214
275
215
276
Examples are originally ported from Stephen Merity's [cc-mrjob](https://github.com/commoncrawl/cc-mrjob/) with the following changes and upgrades:
0 commit comments