Crawler compatible with socks5h proxy

Crawls https://www.nytimes3xbfgragh.onion/.
The search UI will be available at http://localhost:8082.
A /your_folder_path/html/www.nytimes3xbfgragh.onion.xlsx file is created with analyzed URLs and URLS yet to retrieve (in the defined folder below - /your_folder_path)
id.html pages are saved in /your_folder_path/html/www.nytimes3xbfgragh.onion/ (their ids are displayed on search UI results pages)
The website to crawl and the crawler to use are configured in python/Dockerfile and python/settings.py

Start the Search UI, the tor socks proxy and the web crawler:

  git clone https://github.com/Amallyn/torcrawler.git torcrawler
  cd torcrawler
  docker-compose up -d
  cd python
  docker build -t nytcrawler .
  docker run -v /your_folder_path:/var/www --network host -it --rm --name crawl-nytonion nytcrawler

How to stop the crawler:

  ctrl+c

How to restart the crawler:

  docker run -v /your_folder_path:/var/www --network host -it --rm --name crawl-nytonion nytcrawler

How to stop the search UI and the tor socks proxy:

  docker-compose down

Development stack:

docker-compose
docker
manticore
python
requests
beautifulsoup
pymysql

Life cycle

Main Flux Diagram added - discovered db_replace_into missing from the code 🙏 https://github.com/Amallyn/torcrawler/blob/master/flux.jpeg
Most classes and files are documented. Life cycles can also be found in each class doc, usage and main()

Classes & files

python/settings.py
- All settings like www path and website url to crawl are defined here
- Add proxies settings
- Life cycle: see main()
WeightedLink
- url, its weight/priority, date and notes
- link.py
- Life cycle: see main()
CrawlWorkbook
- Excel .xlsx file handling the CrawlFrontier progress
- workbook.py
- Life cycle: see main()
SearchEngine
- Search Engine handled by a Manticore Database
- search.py
- Life cycle: see main()
CrawlFrontier
- Optimize with boost c++ lib
- frontier.py
- Life cycle: see main()
Crawler
- crawler.py
- Repair auto resume, likely in frontier
- Optimize with boost c++ lib
- Life cycle: see main()
NytPage(WebPage)
- New York Times Crawler
- nytcrawler.py
- Refactor/include in Frontier or later in Middleware or Backend
- Life cycle: see main()
NytCrawler(GenericCrawler)
- nytcrawler.py
- Refactor/include in Frontier or later in Middleware or Backend
- Life cycle: see main()
optimize.py
- optimize results
- Check the .xlsx file
- Parse again downloaded wwww/html/www.nytimes3xbfgragh.onion/*.html files
- Life cycle: see main()

Notes

Auto resume if file .xlsx is present
crc32 replaced by crc32(sha256(url))

To do

Use boost C++ library optimization
Ignore lists
regexp to Ignore pages
weight/Priority urls supported, not in full effect
Use Dockerfile from https://github.com/dperson/torproxy
Auto change IP inspired by https://github.com/FrackingAnalysis/PyTorStemPrivoxy
Check for DNS Leaks / Add Pihole or a DNS mirror
Roadmap: Middleware and backend
No Frontera integration for now
Code cleanup, variable checks
Refactor as needed for specific crawling
Python tests
Documentation
Tor v3 urls

Tested on:

Debian

Alternatives:

Rewrite wget using socks5
Crawling using socks5 could be done by using curl instead of wget. eg: curl --socks5 127.0.0.1:9050 https://www.nytimes3xbfgragh.onion/
Rewrite scrapy to support socks5h
Use Frontera as crawler frontier (cf. Frontera Request example)

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
python		python
www		www
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
flux.jpeg		flux.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crawler compatible with socks5h proxy

Life cycle

Notes

To do

Tested on:

Alternatives:

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

Amallyn/torcrawler

Folders and files

Latest commit

History

Repository files navigation

Crawler compatible with socks5h proxy

Life cycle

Notes

To do

Tested on:

Alternatives:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages