Skip to content

Commit 7f50c49

Browse files
Add JSON import loading API-compatible drop-in JSON modules
- try to load the most performant module first: - "orjson" (most performant drop-in replacement, cf. #41) - if loading fails fall back to: - "ujson" ("UltraJSON", proved since the beginning of cc-pyspark) - "json" (Python Standard Library)
1 parent 30c913a commit 7f50c49

File tree

4 files changed

+15
-4
lines changed

4 files changed

+15
-4
lines changed

json_importer.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
"""Import JSON modules with drop-in compatible API,
2+
trying modules with faster JSON parsers first: orjson, ujson, json
3+
Cf. https://github.com/commoncrawl/cc-pyspark/issues/41
4+
"""
5+
6+
try:
7+
import orjson as json
8+
except ImportError:
9+
try:
10+
import ujson as json
11+
except ImportError:
12+
import json

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ botocore
22
boto3
33
requests
44
ujson
5+
orjson
56
warcio
67

78
# for link extraction and webgraph construction also:

server_count.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
1-
import ujson as json
2-
31
from sparkcc import CCSparkJob
2+
from json_importer import json
43

54

65
class ServerCountJob(CCSparkJob):

wat_extract_links.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,12 @@
22
import os
33
import re
44

5-
import ujson as json
6-
75
from urllib.parse import urljoin, urlparse
86

97
from pyspark.sql.types import StructType, StructField, StringType
108

119
from sparkcc import CCSparkJob
10+
from json_importer import json
1211

1312

1413
class ExtractLinksJob(CCSparkJob):

0 commit comments

Comments
 (0)