Skip to content
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
dcb5603
Add simplistic Dockerfile
Fusl Mar 5, 2019
17507d8
added libffi-dev libressl-dev packages
Fusl Jan 4, 2021
927a68b
added patch package
Fusl Jan 4, 2021
59da795
lock python version to 3.7
Fusl Jan 4, 2021
33cd061
Merge branch 'master' into feature/dockerfile
acrois Jul 20, 2021
172479e
Parameterize dockerfile, separate volume for data vs installation dir…
acrois Jul 21, 2021
6fb46f7
Add Docker usage to README.md
acrois Jul 22, 2021
067dfdd
Add docker logs usage, attach to container, pause and resume crawl
acrois Jul 22, 2021
9fc98eb
Add method to access running container (not PID 1)
acrois Jul 22, 2021
cbaf124
Update gs-server container name in documentation, update documentatio…
acrois Aug 4, 2021
e4fb82b
Update README.md
acrois Aug 6, 2021
ce419ad
Quick start usage, network isolation, remove need for --dir parameter
acrois Aug 7, 2021
e66de97
Merge branch 'ArchiveTeam:master' into feature/dockerfile
acrois Nov 29, 2021
ded3f49
Update Dockerfile
acrois Aug 14, 2022
f233c3f
Merge branch 'ArchiveTeam:master' into feature/dockerfile
acrois Aug 14, 2022
b434f39
Set executable bit in entrypoint.sh, rename grab-network to gs-networ…
acrois Aug 15, 2022
74feab8
Executable bit
acrois Aug 15, 2022
84d0236
Rename grab-network to gs-network, update documentation for single co…
acrois Aug 15, 2022
ec4c5d4
Executable bit
acrois Aug 15, 2022
6570b0c
Merge branch 'feature/dockerfile' of https://github.com/acrois/grab-s…
acrois Aug 15, 2022
44ae2fd
Add .gitattributes for LF preservation on Windows, update Python and …
acrois Aug 15, 2022
de68a6e
Use su-exec for step-down from root to grab-site user, update Docker …
acrois Aug 15, 2022
d4fbb98
Update documentation for more consistent Docker first-time run, adjus…
acrois Aug 15, 2022
c48f129
Adjust pip installation parameters
acrois Aug 15, 2022
a6210db
Update README to be easier to follow
acrois Aug 15, 2022
e8f82d4
Additional formatting and context to Docker README
acrois Aug 15, 2022
7e20f21
Document Debian 11 Docker daemon setup, configuration option usage an…
acrois Aug 16, 2022
b7df3bf
Merge branch 'ArchiveTeam:master' into feature/dockerfile
acrois May 21, 2023
534f7dc
Merge branch 'ArchiveTeam:master' into feature/dockerfile
acrois Jan 8, 2024
3ee0d2f
feat: Docker build and release
acrois Jan 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
__pycache__
Dockerfile
data
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
__pycache__
data
53 changes: 53 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
ARG PYTHON_VERSION=3.7
ARG ALPINE_VERSION=3.13

FROM python:${PYTHON_VERSION}-alpine${ALPINE_VERSION}

WORKDIR /app
VOLUME [ "/data" ]

ENV GRAB_SITE_INTERFACE=0.0.0.0
ENV GRAB_SITE_PORT=29000
ENV GRAB_SITE_HOST=127.0.0.1
EXPOSE 29000

RUN apk add --no-cache \
git \
gcc \
libxml2-dev \
musl-dev \
libxslt-dev \
g++ \
re2-dev \
libffi-dev \
openssl-dev \
patch \
cargo \
&& ln -s /usr/include/libxml2/libxml /usr/include/libxml \
&& addgroup -S grab-site \
&& adduser -S -G grab-site grab-site \
&& chown -R grab-site:grab-site $(pwd) \
&& mkdir -p /data \
&& chown -R grab-site:grab-site /data

USER grab-site:grab-site
ENV PATH="/app:$PATH"
ENTRYPOINT [ "entrypoint.sh" ]
CMD [ "gs-server" ]

# TODO: resolve dependencies before loading library code to take advantage of build caching
# setup.py requires libgrabsite/__init__.py (__version__ property) to work

COPY --chown=grab-site:grab-site . .

RUN pip install . \
&& chmod +x entrypoint.sh

WORKDIR /data

# docker build -t grab-site:latest .
# docker run --rm -it --entrypoint sh grab-site:latest
# docker network create -d bridge grab-network
# docker run --net=grab-network --name=gs-server -d -p 29000:29000 --restart=unless-stopped grab-site:latest
# docker run --net=grab-network --rm -d -e GRAB_SITE_HOST=gs-server -v ./data:/data:rw grab-site:latest grab-site https://www.example.com/
# docker run --net=grab-network --rm -d -e GRAB_SITE_HOST=gs-server -v C:\projects\grab-site\data:/data:rw grab-site:latest grab-site https://www.example.com/
150 changes: 150 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ please [file an issue](https://github.com/ArchiveTeam/grab-site/issues) - thank
- [Upgrade an existing install](#upgrade-an-existing-install)
- [Usage](#usage)
- [`grab-site` options, ordered by importance](#grab-site-options-ordered-by-importance)
- [Docker](#docker)
- [Warnings](#warnings)
- [Tips for specific websites](#tips-for-specific-websites)
- [Changing ignores during the crawl](#changing-ignores-during-the-crawl)
Expand Down Expand Up @@ -372,6 +373,155 @@ Options can come before or after the URL.

* `--help`: print help text.

### Docker

grab-site and gs-server can be called from Docker! Please see: [Get Docker](https://docs.docker.com/get-docker/) for more information

#### Quick Start

```
# Build application & environment
docker build -t grab-site:latest .

# Create a network for inter-node communication and name resolution on linux hosts
docker network create -d bridge grab-network

# Start gs-server dashboard
docker run --net=grab-network --name=gs-server -d -p 29000:29000 --restart=unless-stopped grab-site:latest

# Create a grab-site instance that crawls a website and stores it in ./data/example.com
# Ensure that ./data exists and container has write privileges on volume
docker run --net=grab-network --rm -d -e GRAB_SITE_HOST=gs-server -v ./data:/data:rw grab-site:latest grab-site https://www.example.com/ --delay=100-250 --concurrency=4 --no-offsite-links
```

#### Docker Build

To build the application, including all dependencies, run:

```
docker build -t grab-site:latest .
```

##### Simple container access

You can use the system's shell to inspect files and run the programs:

```
docker run --rm -it --entrypoint sh grab-site:latest
```

#### Docker Network

grab-site and gs-server communicate to eachother on a network that is isolated from any other container outside of a grab-site related container.

The following will create a docker network called "grab-network" for our gs-server and grab-site instances to talk to (and find eachother) on.

```
docker network create -d bridge grab-network
```

We will use this "grab-network" later in `docker run` commands with the `--net="grab-network"` flag.

#### Run gs-server on Docker

Run a container named "gs-server" to host the dashboard and for grab-site instances to connect to:

```
docker run --net=grab-network --name=gs-server -d -p 29000:29000 --restart=unless-stopped grab-site:latest
```

The server will be running with the port forwarded (the -p parameter) from the host port 29000 -> container port 29000. You can access it via [http://localhost:29000](http://localhost:29000)

##### View logs

To tail the gs-server instance:

```
docker logs -f gs-server
```

##### Attach to process

You can attach local STDIN/STDOUT/STDERR to your running gs-server instance:

```
docker attach gs-server
```

You can exit by using CTRL-p + CTRL+q, as documented further [here](https://docs.docker.com/engine/reference/commandline/attach/).

##### Access container

To enter a container with a running process (either gs-server or grab-site):

```
docker exec -it gs-server sh
```

#### Run grab-site on Docker

The following commands will download example.com to a local directory, "data". This will vary slightly in the example usage, so please review your paths before executing any scripts!

##### Linux host

From any common shell:
```
docker run --net=grab-network --rm -d -e GRAB_SITE_HOST=gs-server -v ./data:/data:rw / grab-site:latest grab-site https://www.example.com/
```

##### Windows host

From a PowerShell window:
```
docker run --net=grab-network --rm -d -e GRAB_SITE_HOST=gs-server -v C:\projects\grab-site\data:/data:rw grab-site:latest grab-site https://www.example.com/
```

Note: Windows file shares can be done several ways, this is using the legacy Windows full path volume sharing which can be "slower" if you are using WSL2.

##### grab-site processes

When you run a docker run with the -d flag you will get returned to you the unique ID of the container. You can use this to follow the logs.

If you'd like to name your container, please specify a --name to the grab-site container you are trying to run. If you do not specify it, it will give it a funny name which you can find from here:

Show all containers:

```
docker ps -a
```

##### View logs

You can then use the container ID or the name here:

```
docker logs -f ead9034470ed
```

##### Attach to process

You can attach local STDIN/STDOUT/STDERR to your running gs-server instance:

```
docker attach ead9034470ed
```

You can exit by using CTRL-p + CTRL+q, as documented further [here](https://docs.docker.com/engine/reference/commandline/attach/).

##### Pause a crawl

You can pause a crawl by using [docker pause](https://docs.docker.com/engine/reference/commandline/pause/):

```
docker pause ead9034470ed
```

Resume it using [docker unpause](https://docs.docker.com/engine/reference/commandline/unpause/):

```
docker unpause ead9034470ed
```

### Warnings

If you pay no attention to your crawls, a crawl may head down some infinite bot
Expand Down
6 changes: 6 additions & 0 deletions entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env sh
set -eax

# TODO set docker default parameters (if not set)

exec "$@"