Skip to content
This repository was archived by the owner on Apr 27, 2022. It is now read-only.
/ ATDS2022 Public archive

A repository for ATDS course semester project at ECE, NTUA, 2022

Notifications You must be signed in to change notification settings

kitsorfan/ATDS2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

5ece688 · Apr 26, 2022

History

42 Commits
Apr 5, 2022
Mar 18, 2022
Mar 16, 2022
Apr 3, 2022
Apr 26, 2022
Mar 18, 2022

Repository files navigation

Advanced Topics in Database Systems

Exercises in Python/SQL, semester project for Advanced Topics in Database Systems course at ECE⚡, NTUA🎓, academic year 2021-2022

Python Spark SQL Hadoop Ubuntu Server

Byte Code Size # Lines of Code Last commit

📋Description

The dataset used for this project is Full MovieLens Dataset .

The project consists of two main parts:

  1. Implement and test 5 requested queries using RDD API and Spark SQL
  2. Do performance analysis for Reduce-Side join, Map-Side join implementations

Details:

  • We used 3 VMs for our cluster ( 1 NameNode , 2 DataNodes )
  • Dataset formats used: csv, dataframe, parquet

Project Goals

  • get familiar with Spark API
  • evaluate performance for a list of queries
  • compare different join algorithms in Spark Map-Reduce

Project's assignment and report are written in greek.

👔Team Members

Name - GitHub Email
Stylianos Kandylakis gmail
Kitsos Orfanopoulos protonmail
Christos Tsoufis gmail

🖥Specifications of VM

OS CPUs RAM Disk space
Ubuntu 16.04 LTS (Xenial) 2 2GB 30GB

🔗Sources