Skip to content

Amallyn/torcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawler compatible with socks5h proxy

Start the Search UI, the tor socks proxy and the web crawler:

  git clone https://github.com/Amallyn/torcrawler.git torcrawler
  cd torcrawler
  docker-compose up -d
  cd python
  docker build -t nytcrawler .
  docker run -v /your_folder_path:/var/www --network host -it --rm --name crawl-nytonion nytcrawler

How to stop the crawler:

  ctrl+c

How to restart the crawler:

  docker run -v /your_folder_path:/var/www --network host -it --rm --name crawl-nytonion nytcrawler

How to stop the search UI and the tor socks proxy:

  docker-compose down

Development stack:

  • docker-compose
  • docker
  • manticore
  • python
  • requests
  • beautifulsoup
  • pymysql

Life cycle

Classes & files

  • python/settings.py
    • All settings like www path and website url to crawl are defined here
    • Add proxies settings
    • Life cycle: see main()
  • WeightedLink
    • url, its weight/priority, date and notes
    • link.py
    • Life cycle: see main()
  • CrawlWorkbook
    • Excel .xlsx file handling the CrawlFrontier progress
    • workbook.py
    • Life cycle: see main()
  • SearchEngine
    • Search Engine handled by a Manticore Database
    • search.py
    • Life cycle: see main()
  • CrawlFrontier
    • Optimize with boost c++ lib
    • frontier.py
    • Life cycle: see main()
  • Crawler
    • crawler.py
    • Repair auto resume, likely in frontier
    • Optimize with boost c++ lib
    • Life cycle: see main()
  • NytPage(WebPage)
    • New York Times Crawler
    • nytcrawler.py
    • Refactor/include in Frontier or later in Middleware or Backend
    • Life cycle: see main()
  • NytCrawler(GenericCrawler)
    • nytcrawler.py
    • Refactor/include in Frontier or later in Middleware or Backend
    • Life cycle: see main()
  • optimize.py

Notes

  • Auto resume if file .xlsx is present
  • crc32 replaced by crc32(sha256(url))

To do

  • Use boost C++ library optimization
  • Ignore lists
  • regexp to Ignore pages
  • weight/Priority urls supported, not in full effect
  • Use Dockerfile from https://github.com/dperson/torproxy
  • Auto change IP inspired by https://github.com/FrackingAnalysis/PyTorStemPrivoxy
  • Check for DNS Leaks / Add Pihole or a DNS mirror
  • Roadmap: Middleware and backend
  • No Frontera integration for now
  • Code cleanup, variable checks
  • Refactor as needed for specific crawling
  • Python tests
  • Documentation
  • Tor v3 urls

Tested on:

  • Debian

Alternatives:

  • Rewrite wget using socks5
  • Crawling using socks5 could be done by using curl instead of wget. eg: curl --socks5 127.0.0.1:9050 https://www.nytimes3xbfgragh.onion/
  • Rewrite scrapy to support socks5h
  • Use Frontera as crawler frontier (cf. Frontera Request example)

About

Tor crawler - searching with a Manticore Database

Resources

License

Stars

Watchers

Forks

Packages

No packages published