diff --git a/dataeng/README.md b/dataeng/README.md index 3df8bdd1..dac11c47 100644 --- a/dataeng/README.md +++ b/dataeng/README.md @@ -1,187 +1,149 @@ -### Prerequisites -* Python 3.7 or greater -* Docker 19.03 or greater -* Git 2.28 or greater -* Postgres 13 or greater - -## Level 1 - -### Files definitions: - -- src_data - Path with source data needed to be processed. -- processed_data - Path with output processed data. -- user_id.jpg - User image file, for example, 0001.jpg. Could be several for different users in source data path. -- user_id.csv - User info file, for example, 0001.csv. Could be several for different users in source data path. - -User csv file contains next columns: - -1. first_name - User first name -2. last_name - User last name -3. birthts - User birthdate timestamp in milliseconds UTC - -Test csv and img files could be found in the [02-src-data](./02-src-data) folder - -**For example:** - -```text -first_name, last_name, birthts -Ivan, Ivanov, 946674000000 +# Read me + +### This is the solution of Dataeng Internship task +#### There are 2 working modes +##### First Working mode: +Reading all the CSV files of the users and checking the previously processed output file, then filtering the duplicate user_ids and then updating the processed output file with the new users. +##### Second Working mode: +Only if we have an already processed output file, the client has the ability to edit one of the users in the database, He can change the first_name and the last_name of the selected user. +##### Code structure +The code is divided into a main part and 4 functions +1- A simple function for reading a single CSV file: list +2- A function that concatenates all the data in the CSV files +3- A function that writes data into a CSV file +4- A function that checks wether there's a processed output file before or not +The main function takes the selection of the working mode and then guides the program into the needed functionality. + +## SQL Answers +### 1. Rewrite the SQL without subquery: +```SQL SELECT id + FROM USERS AS usertable + JOIN departments AS deptable + ON deptable.user_id = usertable.id +WHERE department_id !=1; ``` - -### Data processing description - -1. Read csv file -2. Match images for each user -3. Combine data from CSV and image path -4. Update processed_data/output.csv CSV file and add new data. Important we can update data for previously processed - user. In output CSV and DB we should not duplicate records. Output CSV file format: user_id, first_name, - last_name, birthts, img_path - -## Task - -Implement a script to process files from the `src_data` folder. - -## Results delivery format - -Results should be implemented as a python script with demo data. Also should be -provided the README.md file with the description of your solution. - -## Level 2 -The same as **Level 1** with the following extras. - -## Results delivery format - -Results should be implemented as a service. The service should periodically read source data and process it. -Also, the service should implement web server with next endpoints: -- **GET** /data - get all records from DB in JSON format. Need to implement filtering by: is_image_exists = True/False, user min_age and max_age in years. -- **POST** /data - manually run data processing in src_data - -Should be provided the README.md file with the description of your solution. - -## Level 3 -The same as **Level 2** but with next differences. - -### Files definitions: -Source data and processed data should store in Minio. Minio service already defined in [docker-compose](./01-docker-compose/docker-compose.yml) file. - -### Data processing description - -1. Read csv file -2. Match images for each user -3. Combine data from CSV and image path -4. Update processed_data/output.csv CSV file and add new data. Important we can update data for previously processed - user. In output CSV and DB we should not to duplicate records. Output CSV file format: user_id, first_name, - last_name, birthts, img_path -5. Write this combined data to DB. Record should contain next columns: id, user_id, first_name, last_name, birthdate, img_path. id - autoincrement unique record id. -Postgres DB service already defined in [docker-compose](./01-docker-compose/docker-compose.yml) - -## Results delivery format - -Results should be implemented as a service. The service should periodically read source data and process it. -Also, the service should implement web server with next endpoints: -- **GET** /data - get all records from DB in JSON format. Need to implement filtering by: is_image_exists = True/False, user min_age and max_age in years. -- **POST** /data - manually run data processing in src_data - -The solution should work in docker-compose. As base template can be taken [docker-compose](./01-docker-compose/docker-compose.yml) file. - -**As a solution, you should implement one of the levels. You don't need to implement all of them, just choose the one you can solve.** -## Coding Tasks for Data Engineers -The following tasks cover different sections to check candidate's basic knowledge in SQL, Algorithms and Linux shell. - -### SQL -1. Rewrite this SQL without subquery: -```sql -SELECT id -FROM users -WHERE id NOT IN ( - SELECT user_id - FROM departments - WHERE department_id = 1 -); +### 2. Write a SQL query to find all duplicate lastnames in a table named **user** +```SQL SELECT lastname, COUNT(lastname) +FROM USERS +GROUP BY lastname +HAVING COUNT(lastname) > 1; ``` -2. Write a SQL query to find all duplicate lastnames in a table named **user** -```text -+----+-----------+----------- -| id | firstname | lastname | -+----+-----------+----------- -| 1 | Ivan | Sidorov | -| 2 | Alexandr | Ivanov | -| 3 | Petr | Petrov | -| 4 | Stepan | Ivanov | -+----+-----------+----------+ +### 3. Write a SQL query to get a username from the **user** table with the second highest salary from **salary** tables. Show the username and it's salary in the result. +```SQL SELECT USERTABLE.username, SALARYTABLE.salary + FROM salary AS SALARYTABLE + INNER JOIN user AS USERTABLE + ON SALARYTABLE.user_id = USERTABLE.id + ORDER + BY SALARYTABLE.salary DESC +LIMIT 1 OFFSET 1; ``` -3. Write a SQL query to get a username from the **user** table with the second highest salary from **salary** tables. Show the username and it's salary in the result. -```sql -+---------+--------+ -| user_id | salary | -+----+--------+----+ -| 1 | 1000 | -| 2 | 1100 | -| 3 | 900 | -| 4 | 1200 | -+---------+--------+ -``` -```sql -+---------+--------+ -| id | username | -+----+--------+----+ -| 1 | Alex | -| 2 | Maria | -| 3 | Bob | -| 4 | Sean | -+---------+-------+ +## Algorithms & Datastructre +### 1. Optimization of the Python code snippet: +```python +from collections import Counter +def count_connections(list1: list, list2: list) -> int: + counter1 = Counter(list1) + counter2 = Counter(list2) + l1 = set(list1) + intersections = l1.intersection(list2) + sum = 0 + for i in intersections: + sum += int(counter1[i]) * int(counter2[i]) + return sum ``` -### Algorithms and Data Structures -1. Optimise execution time of this Python code snippet: +### 2. Given a string `s`, find the length of the longest substring without repeating characters. Analyze your solution and please provide Space and Time complexities. +```python +def findLongestSubstring(string): + if len(string) == 0: + return 0 + n = len(string) + # starting point of current substring. + st = 0 + # maximum length substring without + # repeating characters. maxlen = 0 + # starting index of maximum + # length substring. start = 0 + # Hash Map to store last occurrence + # of each already visited character. pos = {} + # Last occurrence of first + # character is index 0 pos[string[0]] = 0 + for i in range(1, n): + # If this character is not present in hash, + # then this is first occurrence of this # character, store this in hash. if string[i] not in pos: + pos[string[i]] = i + else: + # If this character is present in hash then + # this character has previous occurrence, # check if that occurrence is before or after # starting point of current substring. if pos[string[i]] >= st: + + # find length of current substring and + # update maxlen and start accordingly. currlen = i - st + if maxlen < currlen: + maxlen = currlen + start = st + # Next substring will start after the last + # occurrence of current character to avoid # its repetition. st = pos[string[i]] + 1 + # Update last occurrence of + # current character. pos[string[i]] = i + # Compare length of last substring with maxlen + # and update maxlen and start accordingly. if maxlen < i - st: + maxlen = i - st + start = st + # The required longest substring without + # repeating characters is from string[start] # to string[start+maxlen-1]. return string[start: start + maxlen] ``` -def count_connections(list1: list, list2: list) -> int: - count = 0 - - for i in list1: - for j in list2: - if i == j: - count += 1 - - return count +**Time Complexity:** O(n) +**Auxiliary Space:** O(n) +### 3. Given a sorted array of distinct integers and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order. +```python +def binary_search(arr: list, low, high, target): + if target < arr[0]: + return 0 + elif target > arr[-1]: + return len(arr) + # Check base case + if high >= low: + mid = (high + low) // 2 + # If element is present at the middle itself + if arr[mid] == target: + return mid + # If element is smaller than mid, then it can only + # be present in left subarray elif arr[mid] > target: + return binary_search(arr, low, mid - 1, target) + # Else the element can only be present in right subarray + else: + return binary_search(arr, mid + 1, high, target) + else: + # Element is not present in the array, return the index where it should've been + return high + 1 ``` - -2. Given a string `s`, find the length of the longest substring without repeating characters. - Analyze your solution and please provide Space and Time complexities. - -**Example 1** -```text -Input: s = "abcabcbb" -Output: 3 -Explanation: The answer is "abc", with the length of 3. +```python +def linear_search(list1: list, target): + if target < list1[0]: + return 0 + for i in range(len(list1)): + if target == list1[i]: + return i + elif target < list1[i]: + return i + else: + return len(list1) ``` -**Example 2** -```text -Input: s = "bbbbb" -Output: 1 -Explanation: The answer is "b", with the length of 1. +## Linux Adminstration +### 1. List processes listening on ports 80 and 443 +```bash +sudo netstat -tnlp | grep :443 +sudo netstat -tnlp | grep :80 ``` -**Example 3** -```text -Input: s = "pwwkew" -Output: 3 -Explanation: The answer is "wke", with the length of 3. -Notice that the answer must be a substring, "pwke" is a subsequence and not a substring. +### 2. List process environment variables by given PID +```bash +cat /proc/[process ID]/environ | tr '\0' '\n' ``` -**Example 3** -```text -Input: s = "" -Output: 0 +### 3. Launch a python program `my_program.py` through CLI in the background. How would you close it after some period of time? +```bash +nohup ./my_program.py & +ps -ef | grep my_program.py +kill -9 [PID] ``` -3. Given a sorted array of distinct integers and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order. -**Example:** -```text -Input: nums = [1,3,5,6], target = 5 -Output: 2 -``` -### Linux Shell -1. List processes listening on ports 80 and 443 -2. List process environment variables by given PID -3. Launch a python program `my_program.py` through CLI in the background. How would you close it after some period of time? diff --git a/dataeng/main.py b/dataeng/main.py new file mode 100644 index 00000000..45e8c5a5 --- /dev/null +++ b/dataeng/main.py @@ -0,0 +1,112 @@ +# Import necessary libraries +import csv +import os +from os import listdir +from os.path import isfile, join +import pandas as pd +from pathlib import Path + +# Fixed Paths variables +home = str(Path.home()) +wd = join(home, 'internship/dataeng') +src_wd = join(wd, '02-src-data') +prc_wd = join(wd, 'processed_data') +out_wd = join(prc_wd, 'output.csv') + +# Working mode +wor_mode = "press 1 for reading the source files and updating the output file \n" \ + "press 2 for editing the processed data \n" \ + "press any other key to quit" + + +def main(): + print(wor_mode) + wor_sel = input() + + if wor_sel == '1': + df = check_processed_file(prc_wd, src_wd) + flag = df.empty + if not flag: + headers, data_rows = read_all_csv(src_wd) + ndf = pd.DataFrame(data_rows, columns=headers) + ndf['user_id'] = ndf['user_id'].astype('int64') + filtered_df = pd.concat([df, ndf]).drop_duplicates(subset=['user_id']).reset_index(drop=True) + write_df_csv(out_wd, headers, data_rows, filtered_df) + else: + headers, data_rows = read_all_csv(src_wd) + write_df_csv(out_wd, headers, data_rows) + + elif wor_sel == '2': + df = check_processed_file(prc_wd, src_wd) + flag = df.empty + if not flag: + try: + user_id = int(input("Please enter the user id to edit")) + user_idx = df[df['user_id'] == user_id].index.values[0] + user_in = input("New first name") + df.at[user_idx, "first_name"] = user_in + user_in = input("New last name") + df.at[user_idx, " last_name"] = user_in + print(df) + write_df_csv(out_wd, dataframe=df) + except: + print("Please enter the id number of the user") + else: + print("No database available yet") + return None + + +def read_csv_simple(file_dir): + with open(file_dir, encoding='utf-8') as csv_file: + data = list(csv.reader(csv_file, delimiter=',')) + return data[0], data[1] + + +def read_all_csv(src_wd): + headers = [] + data_rows = [] + onlyfiles = [f for f in listdir(src_wd) if isfile(join(src_wd, f))] + + for file_path in onlyfiles: + fp = join(src_wd, file_path) + # Split the extension from the path and normalise it to lowercase. + ext = os.path.splitext(fp)[-1].lower() + fp_wo_ext = os.path.splitext(fp)[0].lower() + # Now we can simply use == to check for equality, no need for wildcards. + if ext == ".csv": + headers, data = read_csv_simple(fp) + user_id = f"{fp_wo_ext[-4:]}" + img_fp = f"{fp_wo_ext}.png" + data.insert(0, f"{user_id}") + data.append(f"{img_fp}") + data_rows.append(data) + headers.insert(0, 'user_id') + headers.append('img_path') + return headers, data_rows + + +def write_df_csv(out_wd, headers=None, data=None, dataframe=pd.DataFrame()): + flag = dataframe.empty + if not flag: + dataframe.to_csv(out_wd, index=False, encoding="utf-8") + else: + df = pd.DataFrame(data, columns=headers) + df.to_csv(out_wd, index=False, encoding="utf-8") + return True + + +def check_processed_file(prc_wd, src_wd): + flag = os.path.exists(out_wd) + if flag: + df = pd.read_csv(out_wd, encoding="utf-8") + return df + else: + try: + os.mkdir(prc_wd) + except: + print("processed folder exists") + return pd.DataFrame() + + +if __name__ == '__main__': + main() diff --git a/dataeng/optimization.py b/dataeng/optimization.py new file mode 100644 index 00000000..ed16523e --- /dev/null +++ b/dataeng/optimization.py @@ -0,0 +1,129 @@ +from collections import Counter +import numpy as np + + +list1 = list(np.round(np.random.rand(100)*3)) +list2 = list(np.round(np.random.rand(100)*3)) + + +def count_connections(list1: list, list2: list) -> int: + counter1 = Counter(list1) + counter2 = Counter(list2) + l1 = set(list1) + intersections = l1.intersection(list2) + sum = 0 + for i in intersections: + sum += int(counter1[i]) * int(counter2[i]) + return sum + + +def findLongestSubstring(string): + if len(string) == 0: + return 0 + n = len(string) + # starting point of current substring. + st = 0 + # maximum length substring without + # repeating characters. + maxlen = 0 + # starting index of maximum + # length substring. + start = 0 + # Hash Map to store last occurrence + # of each already visited character. + pos = {} + # Last occurrence of first + # character is index 0 + pos[string[0]] = 0 + for i in range(1, n): + # If this character is not present in hash, + # then this is first occurrence of this + # character, store this in hash. + if string[i] not in pos: + pos[string[i]] = i + else: + # If this character is present in hash then + # this character has previous occurrence, + # check if that occurrence is before or after + # starting point of current substring. + if pos[string[i]] >= st: + + # find length of current substring and + # update maxlen and start accordingly. + currlen = i - st + if maxlen < currlen: + maxlen = currlen + start = st + # Next substring will start after the last + # occurrence of current character to avoid + # its repetition. + st = pos[string[i]] + 1 + # Update last occurrence of + # current character. + pos[string[i]] = i + # Compare length of last substring with maxlen + # and update maxlen and start accordingly. + if maxlen < i - st: + maxlen = i - st + start = st + # The required longest substring without + # repeating characters is from string[start] + # to string[start+maxlen-1]. + return string[start: start + maxlen] + + +string = "abcabcbb" +print(findLongestSubstring(string)) +string = "bbbbb" +print(findLongestSubstring(string)) +string = "pwwkew" +print(findLongestSubstring(string)) +string = "" +print(findLongestSubstring(string)) + + +def linear_search(list1: list, target): + if target < list1[0]: + return 0 + for i in range(len(list1)): + if target == list1[i]: + return i + elif target < list1[i]: + return i + else: + return len(list1) + + +# Returns index of x in arr if present, else -1 +def binary_search(arr, low, high, target): + if target < arr[0]: + return 0 + elif target > arr[-1]: + return len(arr) + # Check base case + if high >= low: + mid = (high + low) // 2 + # If element is present at the middle itself + if arr[mid] == target: + return mid + # If element is smaller than mid, then it can only + # be present in left subarray + elif arr[mid] > target: + return binary_search(arr, low, mid - 1, target) + # Else the element can only be present in right subarray + else: + return binary_search(arr, mid + 1, high, target) + else: + # Element is not present in the array, return the index where it should've been + return high + 1 + + +# Test array +arr = [1, 2, 3, 4, 5, 6, 7, 10, 15] + + + +result = binary_search(arr, 0, len(arr) - 1, 4.5) +idx = linear_search(arr, 2) +print(idx, result) +