Skip to content

Commit 566831a

Browse files
authored
Support local path (working outside CCF dev server) for querying the index with DuckDb (#14)
1 parent 25af34c commit 566831a

File tree

3 files changed

+96
-12
lines changed

3 files changed

+96
-12
lines changed

Makefile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,12 @@ duck_ccf_local_files: build
4444
@echo "warning! only works on Common Crawl Foundation's development machine"
4545
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args="ccf_local_files"
4646

47+
duck_local_files: build
48+
ifndef LOCAL_DIR
49+
$(error LOCAL_DIR is required. Usage: make duck_local_files LOCAL_DIR=/path/to/data)
50+
endif
51+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args="local_files $(LOCAL_DIR)"
52+
4753
duck_cloudfront: build
4854
@echo "warning! this might take 1-10 minutes"
4955
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args="cloudfront"

README.md

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -791,11 +791,50 @@ The program then writes that one record into a local Parquet file, does a second
791791
### Bonus: download a full crawl index and query with DuckDB
792792

793793
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
794+
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
795+
796+
```shell
797+
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
798+
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ 'crawl=CC-MAIN-2024-22/subset=warc'
799+
```
800+
801+
> [!IMPORTANT]
802+
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
803+
804+
If, by any other chance, you don't have access through the AWS CLI:
805+
806+
```shell
807+
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
808+
cd 'crawl=CC-MAIN-2024-22/subset=warc'
809+
810+
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/cc-index-table.paths.gz
811+
gunzip cc-index-table.paths.gz
812+
813+
grep 'subset=warc' cc-index-table.paths | \
814+
awk '{print "https://data.commoncrawl.org/" $1, $1}' | \
815+
xargs -n 2 -P 10 sh -c '
816+
echo "Downloading: $2"
817+
mkdir -p "$(dirname "$2")" &&
818+
wget -O "$2" "$1"
819+
' _
820+
821+
rm cc-index-table.paths
822+
cd -
823+
```
794824

825+
The structure should be something like this:
795826
```shell
796-
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ .'
827+
tree my_data
828+
my_data
829+
└── crawl=CC-MAIN-2024-22
830+
└── subset=warc
831+
├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
832+
├── part-00001-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
833+
├── part-00002-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
797834
```
798835

836+
Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
837+
799838
> [!IMPORTANT]
800839
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
801840
@@ -821,7 +860,7 @@ We make more datasets available than just the ones discussed in this Whirlwind T
821860

822861
Common Crawl regularly releases Web Graphs which are graphs describing the structure and connectivity of the web as captured in the crawl releases. We provide two levels of graph: host-level and domain-level. Both are available to download [from our website](https://commoncrawl.org/web-graphs).
823862

824-
The host-level graph describes links between pages on the web at the level of hostnames (e.g. `en.wikipedia.org`). The domain-level graph aggregates this information in the host-level graph, describing links at the pay-level domain (PLD) level (based on the public suffix list maintained on [publicsuffix.org](publicsuffix.org)). The PLD is the subdomain directly under the top-level domain (TLD): e.g. for `en.wikipedia.org`, the TLD would be `.org` and the PLD would be `wikipedia.org`.
863+
The host-level graph describes links between pages on the web at the level of hostnames (e.g. `en.wikipedia.org`). The domain-level graph aggregates this information in the host-level graph, describing links at the pay-level domain (PLD) level (based on the public suffix list maintained on [publicsuffix.org](https://publicsuffix.org)). The PLD is the subdomain directly under the top-level domain (TLD): e.g. for `en.wikipedia.org`, the TLD would be `.org` and the PLD would be `wikipedia.org`.
825864

826865
As an example, let's look at the [Web Graph release for March, April and May 2025](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-mar-apr-may/index.html). This page provides links to download data associated with the host- and domain-level graph for those months. The key files needed to construct the graphs are the files containing the vertices or nodes (the hosts or domains), and the files containing the edges (the links between the hosts/domains). These are currently the top two links in each of the tables.
827866

src/main/java/org/commoncrawl/whirlwind/Duck.java

Lines changed: 49 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ public class Duck {
3939
private static final DateTimeFormatter TIMESTAMP_FORMATTER = DateTimeFormatter.ofPattern("yyyyMMddHHmmss");
4040

4141
public enum Algorithm {
42-
CCF_LOCAL_FILES("ccf_local_files"), CLOUDFRONT("cloudfront");
42+
CCF_LOCAL_FILES("ccf_local_files"), CLOUDFRONT("cloudfront"), LOCAL_FILES("local_files");
4343

4444
private final String name;
4545

@@ -114,8 +114,13 @@ public static void printRowAsKvList(ResultSet rs, PrintStream out) throws SQLExc
114114
/**
115115
* Gets the list of parquet files to query based on the algorithm.
116116
*/
117-
public static List<String> getFiles(Algorithm algo, String crawl) throws IOException {
117+
public static List<String> getFiles(Algorithm algo, String crawl, String localPrefix) throws IOException {
118118
switch (algo) {
119+
case LOCAL_FILES: {
120+
Path indexPath = Path.of(localPrefix);
121+
return getLocalParquetFiles(indexPath);
122+
}
123+
119124
case CCF_LOCAL_FILES: {
120125
Path indexPath = Path.of("/home/cc-pds/commoncrawl/cc-index/table/cc-main/warc", "crawl=" + crawl,
121126
"subset=warc");
@@ -143,6 +148,23 @@ public static List<String> getFiles(Algorithm algo, String crawl) throws IOExcep
143148
}
144149
}
145150

151+
private static List<String> getLocalParquetFiles(Path indexPath) throws IOException {
152+
if (!Files.isDirectory(indexPath)) {
153+
System.err.println("Directory not found: " + indexPath);
154+
System.exit(1);
155+
}
156+
157+
List<String> files = Files.walk(indexPath).map(Path::toString).filter(string -> string.endsWith(".parquet"))
158+
.collect(Collectors.toList());
159+
160+
if (files.isEmpty()) {
161+
System.err.println("No parquet files found in: " + indexPath);
162+
System.exit(1);
163+
}
164+
165+
return files;
166+
}
167+
146168
private static List<String> getLocalParquetFiles(Path indexPath, String prefix, String crawl) throws IOException {
147169
if (!Files.isDirectory(indexPath)) {
148170
printIndexDownloadAdvice(prefix, crawl);
@@ -190,6 +212,7 @@ private static ResultSet executeWithRetry(Statement stmt, String sql) throws SQL
190212
public static void main(String[] args) {
191213
String crawl = "CC-MAIN-2024-22";
192214
Algorithm algo = Algorithm.CLOUDFRONT;
215+
String localPrefix = "/home/cc-pds/commoncrawl/cc-index/table/cc-main/warc";
193216

194217
if (args.length > 0) {
195218
if ("help".equalsIgnoreCase(args[0]) || "--help".equals(args[0]) || "-h".equals(args[0])) {
@@ -201,20 +224,30 @@ public static void main(String[] args) {
201224
System.out.println("Using algorithm: " + algo.getName());
202225
}
203226

227+
if (algo == Algorithm.LOCAL_FILES) {
228+
if (args.length < 2) {
229+
System.err.println("Error: local_files algorithm requires a directory argument.");
230+
printUsage();
231+
System.exit(1);
232+
}
233+
localPrefix = args[1];
234+
}
235+
204236
try {
205-
run(algo, crawl);
237+
run(algo, crawl, localPrefix);
206238
} catch (Exception e) {
207239
System.err.println("Error: " + e.getMessage());
208240
printUsage();
209241
System.exit(1);
210242
}
211243
}
212244

213-
public static void run(Algorithm algo, String crawl) throws IOException, SQLException, InterruptedException {
245+
public static void run(Algorithm algo, String crawl, String localPrefix)
246+
throws IOException, SQLException, InterruptedException {
214247
// Ensure stdout uses UTF-8
215248
PrintStream out = new PrintStream(System.out, true, StandardCharsets.UTF_8);
216249

217-
List<String> files = getFiles(algo, crawl);
250+
List<String> files = getFiles(algo, crawl, localPrefix);
218251
String filesList = files.stream().map(f -> "'" + f + "'").collect(Collectors.joining(", "));
219252

220253
// Use in-memory DuckDB
@@ -230,15 +263,16 @@ public static void run(Algorithm algo, String crawl) throws IOException, SQLExce
230263

231264
// Count total records
232265
out.printf("Total records for crawl: %s%n", crawl);
233-
try (ResultSet rs = executeWithRetry(stmt, "SELECT COUNT(*) as cnt FROM ccindex")) {
266+
try (ResultSet rs = executeWithRetry(stmt,
267+
"SELECT COUNT(*) as cnt FROM ccindex " + "WHERE subset = 'warc' AND crawl = '" + crawl + "'")) {
234268
if (rs.next()) {
235269
out.println(rs.getLong("cnt"));
236270
}
237271
}
238272

239273
// Query for our specific row
240-
String selectQuery = "" + "SELECT * FROM ccindex WHERE subset = 'warc' " + "AND crawl = 'CC-MAIN-2024-22' "
241-
+ "AND url_host_tld = 'org' " + "AND url_host_registered_domain = 'wikipedia.org' "
274+
String selectQuery = "SELECT * FROM ccindex WHERE subset = 'warc' AND crawl = '" + crawl + "' "
275+
+ "AND url_host_tld = 'org' AND url_host_registered_domain = 'wikipedia.org' "
242276
+ "AND url = 'https://an.wikipedia.org/wiki/Escopete'";
243277

244278
out.println("Our one row:");
@@ -305,14 +339,19 @@ private static void printResultSet(ResultSet rs, PrintStream out) throws SQLExce
305339
}
306340

307341
private static void printUsage() {
308-
System.err.println("Usage: Duck [algorithm]");
342+
System.err.println("Usage: Duck [algorithm] [local-directory]");
309343
System.err.println();
310344
System.err.println("Query Common Crawl index using DuckDB.");
311345
System.err.println();
312346
System.err.println("Algorithms:");
313-
System.err.println(" ccf_local_files Use local parquet files from /home/cc-pds/commoncrawl/...");
347+
System.err.println(" local_files Use local parquet files (from specified local directory)");
348+
System.err.println(
349+
" ccf_local_files Use local parquet files (default: /home/cc-pds/commoncrawl/cc-index/table/cc-main/warc)");
314350
System.err.println(" cloudfront Use CloudFront URLs (requires <crawl>.warc.paths.gz file)");
315351
System.err.println();
352+
System.err.println("Arguments:");
353+
System.err.println(" local-directory Local directory prefix for 'local_files' algorithm");
354+
System.err.println();
316355
System.err.println("Options:");
317356
System.err.println(" help, --help, -h Show this help message");
318357
}

0 commit comments

Comments
 (0)