Skip to content

Commit 0ecf48f

Browse files
authored
.
1 parent a7d5101 commit 0ecf48f

File tree

8 files changed

+728
-2
lines changed

8 files changed

+728
-2
lines changed

README.md

Lines changed: 96 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,96 @@
1-
# wayback-go
2-
A wayback machine site downloader
1+
# Wayback Go Downloader
2+
3+
A command-line tool to download websites from the Wayback Machine, re-written in Go.
4+
5+
## Overview
6+
7+
This program is a Go port of the popular Ruby-based `wayback-machine-downloader` by hartator (available at [https://github.com/hartator/wayback-machine-downloader](https://github.com/hartator/wayback-machine-downloader)). It allows you to download all available snapshots of a given URL from the Internet Archive's Wayback Machine, saving them locally.
8+
9+
## Features
10+
11+
* **Download Entire Websites:** Recursively downloads all files associated with a given URL from the Wayback Machine.
12+
* **Exact URL Download:** Option to download only the exact URL provided, without following links.
13+
* **Timestamp Filtering:** Specify `from` and `to` timestamps to download snapshots within a particular date range.
14+
* **Regex Filtering:** Include or exclude URLs based on regular expressions.
15+
* **All Timestamps:** Download all available timestamps for each file, not just the latest.
16+
* **Concurrency:** Utilizes multiple threads for faster downloads.
17+
* **List Only Mode:** Preview the list of files that would be downloaded in JSON format without actually downloading them.
18+
* **Error Handling:** Option to download all files, even those that return errors.
19+
20+
## Installation
21+
22+
To install `wayback-go`, you need to have Go installed on your system (Go 1.16 or later is recommended).
23+
24+
1. **Clone the repository:**
25+
```bash
26+
git clone https://github.com/your-username/wayback-go.git # Replace with actual repo URL
27+
cd wayback-go
28+
```
29+
2. **Build the executable:**
30+
```bash
31+
go build -o wayback-go
32+
```
33+
3. **Move to your PATH (optional):**
34+
```bash
35+
sudo mv wayback-go /usr/local/bin/
36+
```
37+
38+
## Usage
39+
40+
```bash
41+
./wayback-go --url <URL> [options]
42+
```
43+
44+
### Options:
45+
46+
* `--url <URL>`: The base URL to download from Wayback Machine (required).
47+
* `--exact-url`: Download only the exact URL.
48+
* `--dir <directory>`: Directory to save the downloaded files (defaults to `websites/<domain>`).
49+
* `--all-timestamps`: Download all available timestamps for each file.
50+
* `--from <timestamp>`: Download snapshots from this timestamp (e.g., `20060102150405`).
51+
* `--to <timestamp>`: Download snapshots to this timestamp (e.g., `20060102150405`).
52+
* `--only <regex>`: Only download URLs matching this regex filter.
53+
* `--exclude <regex>`: Exclude URLs matching this regex filter.
54+
* `--all`: Download all files, even if they return an error.
55+
* `--max-pages <number>`: Maximum number of snapshot pages to retrieve from Wayback Machine API (default: 100).
56+
* `--threads <number>`: Number of concurrent download threads (default: 1).
57+
* `--list`: Only list file URLs in JSON format, won't download anything.
58+
59+
### Examples:
60+
61+
1. **Download a website:**
62+
```bash
63+
./wayback-go --url https://example.com
64+
```
65+
2. **Download only a specific URL:**
66+
```bash
67+
./wayback-go --url https://example.com/page.html --exact-url
68+
```
69+
3. **Download with a specific output directory:**
70+
```bash
71+
./wayback-go --url https://example.com --dir my_archive
72+
```
73+
4. **Download snapshots from a specific date:**
74+
```bash
75+
./wayback-go --url https://example.com --from 20200101000000 --to 20201231235959
76+
```
77+
5. **List files in JSON format:**
78+
```bash
79+
./wayback-go --url https://example.com --list
80+
```
81+
6. **Download with 5 concurrent threads:**
82+
```bash
83+
./wayback-go --url https://example.com --threads 5
84+
```
85+
7. **Only download CSS files:**
86+
```bash
87+
./wayback-go --url https://example.com --only "\.css$"
88+
```
89+
90+
## Contributing
91+
92+
Contributions are welcome! Please feel free to open issues or submit pull requests.
93+
94+
## License
95+
96+
This project is licensed under the MIT License. See the `LICENSE` file for details.

archive.go

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
package main
2+
3+
import (
4+
"encoding/json"
5+
"fmt"
6+
"io/ioutil"
7+
"net/http"
8+
"net/url"
9+
"strconv"
10+
)
11+
12+
// getRawListFromAPI fetches a raw list of snapshots from the Wayback Machine CDX API.
13+
func (d *Downloader) getRawListFromAPI(targetURL string, pageIndex int) ([]FileRemoteInfo, error) {
14+
requestURL, err := url.Parse("https://web.archive.org/cdx/search/xd")
15+
if err != nil {
16+
return nil, fmt.Errorf("error parsing base URL: %w", err)
17+
}
18+
19+
params := url.Values{}
20+
params.Add("output", "json")
21+
params.Add("url", targetURL)
22+
23+
// Add parameters for API
24+
params.Add("fl", "timestamp,original")
25+
params.Add("collapse", "digest")
26+
params.Add("gzip", "false")
27+
28+
if !d.All {
29+
params.Add("filter", "statuscode:200")
30+
}
31+
32+
if d.FromTimestamp != 0 {
33+
params.Add("from", strconv.Itoa(d.FromTimestamp))
34+
}
35+
if d.ToTimestamp != 0 {
36+
params.Add("to", strconv.Itoa(d.ToTimestamp))
37+
}
38+
39+
if pageIndex != -1 {
40+
params.Add("page", strconv.Itoa(pageIndex))
41+
}
42+
43+
requestURL.RawQuery = params.Encode()
44+
45+
resp, err := http.Get(requestURL.String())
46+
if err != nil {
47+
return nil, fmt.Errorf("error making API request: %w", err)
48+
}
49+
defer resp.Body.Close()
50+
51+
body, err := ioutil.ReadAll(resp.Body)
52+
if err != nil {
53+
return nil, fmt.Errorf("error reading API response: %w", err)
54+
}
55+
56+
var rawJSON [][]string
57+
err = json.Unmarshal(body, &rawJSON)
58+
if err != nil {
59+
// If parsing fails, it might be an empty array or malformed JSON
60+
return []FileRemoteInfo{}, nil
61+
}
62+
63+
if len(rawJSON) > 0 && len(rawJSON[0]) == 2 && rawJSON[0][0] == "timestamp" && rawJSON[0][1] == "original" {
64+
rawJSON = rawJSON[1:] // Remove header row
65+
}
66+
67+
var snapshots []FileRemoteInfo
68+
for _, item := range rawJSON {
69+
if len(item) == 2 {
70+
snapshots = append(snapshots, FileRemoteInfo{
71+
Timestamp: item[0],
72+
FileURL: item[1],
73+
})
74+
}
75+
}
76+
77+
return snapshots, nil
78+
}

0 commit comments

Comments
 (0)