DistributedWebCrawler

It consists of a master process as well as various workers. The master when given a seed URL it will crawl all linked pages upto a to a given depth n and display the connectivity graph. It will also calculate the page rank for each page.

Required libraries

BeautifulSoup4
Requests
Pyro5
Pyviz

Usage

To activate virtual env with the above libraries, execute in each terminal source project_env/bin/activate

To start nameserver, execute in a terminal pyro5-ns

To start workers, execute the follwing in as many seperate terminals as required python3 worker.py

To start master, execute the follwing in a seperate terminal python3 master.py

Currently supported queries:

seed url_of_root_page n
- Srapes the urls present in the url_of_root_page uptill depth n with respect to bfs. Eg: seed https://iiit.ac.in 2 Scrapes all the urls till depth 2 from iiit.ac.in url(at depth 1).
graph [-s]
- Saves the graph to specified file and open in browser. Use -s flag if u need to view graph settings. Eg: graph -s
update url
- Updates the adjacency list of the url Eg: update https://iiit.ac.in

use ctrl-d to exit

Team SpaceBar

Adarsh Dharmadevan
Apoorva Thirupathi
Gadela Keshav
Guru Ravi Shanker
Joseph Cherukara

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
project_env		project_env
.gitignore		.gitignore
README.md		README.md
master.py		master.py
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DistributedWebCrawler

Required libraries

Usage

Team SpaceBar

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DistributedWebCrawler

Required libraries

Usage

Team SpaceBar

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages