- Crawls https://www.nytimes3xbfgragh.onion/.
- The search UI will be available at http://localhost:8082.
- A /your_folder_path/html/www.nytimes3xbfgragh.onion.xlsx file is created with analyzed URLs and URLS yet to retrieve (in the defined folder below - /your_folder_path)
- id.html pages are saved in /your_folder_path/html/www.nytimes3xbfgragh.onion/ (their ids are displayed on search UI results pages)
- The website to crawl and the crawler to use are configured in python/Dockerfile and python/settings.py
Start the Search UI, the tor socks proxy and the web crawler:
git clone https://github.com/Amallyn/torcrawler.git torcrawler
cd torcrawler
docker-compose up -d
cd python
docker build -t nytcrawler .
docker run -v /your_folder_path:/var/www --network host -it --rm --name crawl-nytonion nytcrawler
How to stop the crawler:
ctrl+c
How to restart the crawler:
docker run -v /your_folder_path:/var/www --network host -it --rm --name crawl-nytonion nytcrawler
How to stop the search UI and the tor socks proxy:
docker-compose down
Development stack:
- docker-compose
- docker
- manticore
- python
- requests
- beautifulsoup
- pymysql
- Main Flux Diagram added - discovered db_replace_into missing from the code 🙏 https://github.com/Amallyn/torcrawler/blob/master/flux.jpeg
- Most classes and files are documented. Life cycles can also be found in each class doc, usage and main()
Classes & files
- python/settings.py
- All settings like www path and website url to crawl are defined here
- Add proxies settings
- Life cycle: see main()
- WeightedLink
- url, its weight/priority, date and notes
- link.py
- Life cycle: see main()
- CrawlWorkbook
- Excel .xlsx file handling the CrawlFrontier progress
- workbook.py
- Life cycle: see main()
- SearchEngine
- Search Engine handled by a Manticore Database
- search.py
- Life cycle: see main()
- CrawlFrontier
- Optimize with boost c++ lib
- frontier.py
- Life cycle: see main()
- Crawler
- crawler.py
- Repair auto resume, likely in frontier
- Optimize with boost c++ lib
- Life cycle: see main()
- NytPage(WebPage)
- New York Times Crawler
- nytcrawler.py
- Refactor/include in Frontier or later in Middleware or Backend
- Life cycle: see main()
- NytCrawler(GenericCrawler)
- nytcrawler.py
- Refactor/include in Frontier or later in Middleware or Backend
- Life cycle: see main()
- optimize.py
- optimize results
- Check the .xlsx file
- Parse again downloaded wwww/html/www.nytimes3xbfgragh.onion/*.html files
- Life cycle: see main()
- Auto resume if file .xlsx is present
- crc32 replaced by crc32(sha256(url))
- Use boost C++ library optimization
- Ignore lists
- regexp to Ignore pages
- weight/Priority urls supported, not in full effect
- Use Dockerfile from https://github.com/dperson/torproxy
- Auto change IP inspired by https://github.com/FrackingAnalysis/PyTorStemPrivoxy
- Check for DNS Leaks / Add Pihole or a DNS mirror
- Roadmap: Middleware and backend
- No Frontera integration for now
- Code cleanup, variable checks
- Refactor as needed for specific crawling
- Python tests
- Documentation
- Tor v3 urls
- Debian
- Rewrite wget using socks5
- Crawling using socks5 could be done by using curl instead of wget. eg: curl --socks5 127.0.0.1:9050 https://www.nytimes3xbfgragh.onion/
- Rewrite scrapy to support socks5h
- Use Frontera as crawler frontier (cf. Frontera Request example)