provectus · phoenixfury · Oct 10, 2021 · Oct 10, 2021
diff --git a/dataeng/README.md b/dataeng/README.md
@@ -1,187 +1,149 @@
-### Prerequisites
-* Python 3.7 or greater
-* Docker 19.03 or greater
-* Git 2.28 or greater
-* Postgres 13 or greater
-
-## Level 1
-
-### Files definitions:
-
-- src_data - Path with source data needed to be processed.
-- processed_data - Path with output processed data.
-- user_id.jpg - User image file, for example, 0001.jpg. Could be several for different users in source data path.
-- user_id.csv - User info file, for example, 0001.csv. Could be several for different users in source data path.
-
-User csv file contains next columns:
-
-1. first_name - User first name
-2. last_name - User last name
-3. birthts - User birthdate timestamp in milliseconds UTC
-
-Test csv and img files could be found in the [02-src-data](./02-src-data) folder
-
-**For example:**
-
-```text
-first_name, last_name, birthts
-Ivan, Ivanov, 946674000000
+# Read me
+
+### This is the solution of Dataeng Internship task
+#### There are 2 working modes
+##### First Working mode:
+Reading all the CSV files of the users and checking the previously processed output file, then filtering the duplicate user_ids and then updating the processed output file with the new users.
+##### Second Working mode:
+Only if we have an already processed output file, the client has the ability to edit one of the users in the database, He can change the first_name and the last_name of the selected user.
+##### Code structure
+The code is divided into a main part and 4 functions
+1- A simple function for reading a single CSV file: list
+2- A function that concatenates all the data in the CSV files
+3- A function that writes data into a CSV file
+4- A function that checks wether there's a processed output file before or not
+The main function takes the selection of the working mode and then guides the program into the needed functionality.
+
+## SQL Answers
+### 1.  Rewrite the SQL without subquery:
+```SQL SELECT id
+  FROM USERS AS usertable
+    JOIN departments AS deptable
+      ON deptable.user_id = usertable.id
+WHERE department_id !=1;
 ```
-
-### Data processing description
-
-1. Read csv file
-2. Match images for each user
-3. Combine data from CSV and image path
-4. Update processed_data/output.csv CSV file and add new data. Important we can update data for previously processed
-   user. In output CSV and DB we should not duplicate records. Output CSV file format: user_id, first_name,
-   last_name, birthts, img_path
-
-## Task
-
-Implement a script to process files from the `src_data` folder.
-
-## Results delivery format
-
-Results should be implemented as a python script with demo data. Also should be
-provided the README.md file with the description of your solution.
-
-## Level 2
-The same as **Level 1** with the following extras.
-
-## Results delivery format
-
-Results should be implemented as a service. The service should periodically read source data and process it.
-Also, the service should implement web server with next endpoints:
-- **GET**  /data - get all records from DB in JSON format. Need to implement filtering by: is_image_exists = True/False, user min_age and max_age in years.
-- **POST** /data - manually run data processing in src_data
-
-Should be provided the README.md file with the description of your solution.
-
-## Level 3
-The same as **Level 2** but with next differences.
-
-### Files definitions:
-Source data and processed data should store in Minio. Minio service already defined in [docker-compose](./01-docker-compose/docker-compose.yml) file.
-
-### Data processing description
-
-1. Read csv file
-2. Match images for each user
-3. Combine data from CSV and image path
-4. Update processed_data/output.csv CSV file and add new data. Important we can update data for previously processed
-   user. In output CSV and DB we should not to duplicate records. Output CSV file format: user_id, first_name,
-   last_name, birthts, img_path
-5. Write this combined data to DB. Record should contain next columns: id, user_id, first_name, last_name, birthdate, img_path. id - autoincrement unique record id.
-Postgres DB service already defined in [docker-compose](./01-docker-compose/docker-compose.yml)
-
-## Results delivery format
-
-Results should be implemented as a service. The service should periodically read source data and process it.
-Also, the service should implement web server with next endpoints:
-- **GET**  /data - get all records from DB in JSON format. Need to implement filtering by: is_image_exists = True/False, user min_age and max_age in years.
-- **POST** /data - manually run data processing in src_data
-
-The solution should work in docker-compose. As base template can be taken [docker-compose](./01-docker-compose/docker-compose.yml) file.
-
-**As a solution, you should implement one of the levels. You don't need to implement all of them, just choose the one you can solve.** 
-## Coding Tasks for Data Engineers
-The following tasks cover different sections to check candidate's basic knowledge in SQL, Algorithms and Linux shell. 
-
-### SQL
-1. Rewrite this SQL without subquery:
-```sql
-SELECT id
-FROM users
-WHERE id NOT IN (
-	SELECT user_id
-	FROM departments
-	WHERE department_id = 1
-);
+### 2.  Write a SQL query to find all duplicate lastnames in a table named  **user**
+```SQL SELECT lastname, COUNT(lastname)
+FROM USERS
+GROUP BY lastname
+HAVING COUNT(lastname) > 1;
 ```
-2. Write a SQL query to find all duplicate lastnames in a table named **user**
-```text
-+----+-----------+-----------
-| id | firstname | lastname |
-+----+-----------+-----------
-| 1  | Ivan      | Sidorov  |
-| 2  | Alexandr  | Ivanov   |
-| 3  | Petr      | Petrov   |
-| 4  | Stepan    | Ivanov   |
-+----+-----------+----------+
+### 3. Write a SQL query to get a username from the  **user**  table with the second highest salary from  **salary**  tables. Show the username and it's salary in the result.
+```SQL SELECT USERTABLE.username, SALARYTABLE.salary
+  FROM salary AS SALARYTABLE
+       INNER JOIN user AS USERTABLE
+          ON SALARYTABLE.user_id = USERTABLE.id
+ ORDER
+    BY SALARYTABLE.salary DESC
+LIMIT 1 OFFSET 1;
 ```
-3. Write a SQL query to get a username from the **user** table with the second highest salary from **salary** tables. Show the username and it's salary in the result.
-```sql
-+---------+--------+
-| user_id | salary |
-+----+--------+----+
-| 1       | 1000   |
-| 2       | 1100   |
-| 3       | 900    |
-| 4       | 1200   |
-+---------+--------+
-```
-```sql
-+---------+--------+
-| id | username    |
-+----+--------+----+
-| 1  | Alex       |
-| 2  | Maria      |
-| 3  | Bob        |
-| 4  | Sean       |
-+---------+-------+
+## Algorithms & Datastructre
+### 1.  Optimization of the Python code snippet:
+```python
+from collections import Counter
+def count_connections(list1: list, list2: list) -> int:
+    counter1 = Counter(list1)
+    counter2 = Counter(list2)
+    l1 = set(list1)
+    intersections = l1.intersection(list2)
+    sum = 0
+  for i in intersections:
+        sum += int(counter1[i]) * int(counter2[i])
+    return sum
 ```
-### Algorithms and Data Structures
-1. Optimise execution time of this Python code snippet:
+### 2.  Given a string  `s`, find the length of the longest substring without repeating characters. Analyze your solution and please provide Space and Time complexities.
+```python
+def findLongestSubstring(string):
+    if len(string) == 0:
+        return 0
+  n = len(string)
+    # starting point of current substring.
+  st = 0
+  # maximum length substring without
+ # repeating characters.  maxlen = 0
+  # starting index of maximum
+ # length substring.  start = 0
+  # Hash Map to store last occurrence
+ # of each already visited character.  pos = {}
+    # Last occurrence of first
+ # character is index 0  pos[string[0]] = 0
+  for i in range(1, n):
+        # If this character is not present in hash,
+ # then this is first occurrence of this # character, store this in hash.  if string[i] not in pos:
+            pos[string[i]] = i
+        else:
+            # If this character is present in hash then
+ # this character has previous occurrence, # check if that occurrence is before or after # starting point of current substring.  if pos[string[i]] >= st:
+
+                # find length of current substring and
+ # update maxlen and start accordingly.  currlen = i - st
+                if maxlen < currlen:
+                    maxlen = currlen
+                    start = st
+                # Next substring will start after the last
+ # occurrence of current character to avoid # its repetition.  st = pos[string[i]] + 1
+  # Update last occurrence of
+ # current character.  pos[string[i]] = i
+    # Compare length of last substring with maxlen
+ # and update maxlen and start accordingly.  if maxlen < i - st:
+        maxlen = i - st
+        start = st
+    # The required longest substring without
+ # repeating characters is from string[start] # to string[start+maxlen-1].  return string[start: start + maxlen]
 ```
-def count_connections(list1: list, list2: list) -> int:
-  count = 0
-
-  for i in list1:
-    for j in list2:
-      if i == j:
-        count += 1
-
-  return count
+**Time Complexity:** O(n)
+**Auxiliary Space:** O(n)
+### 3.  Given a sorted array of distinct integers and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order.
+```python
+def binary_search(arr: list, low, high, target):
+    if target < arr[0]:
+        return 0
+  elif target > arr[-1]:
+        return len(arr)
+    # Check base case
+  if high >= low:
+        mid = (high + low) // 2
+  # If element is present at the middle itself
+  if arr[mid] == target:
+            return mid
+        # If element is smaller than mid, then it can only
+ # be present in left subarray  elif arr[mid] > target:
+            return binary_search(arr, low, mid - 1, target)
+        # Else the element can only be present in right subarray
+  else:
+            return binary_search(arr, mid + 1, high, target)
+    else:
+        # Element is not present in the array, return the index where it should've been
+  return high + 1
 ```
-
-2. Given a string `s`, find the length of the longest substring without repeating characters.
-   Analyze your solution and please provide Space and Time complexities.
-
-**Example 1**
-```text
-Input: s = "abcabcbb"
-Output: 3
-Explanation: The answer is "abc", with the length of 3.
+```python
+def linear_search(list1: list, target):
+    if target < list1[0]:
+        return 0
+  for i in range(len(list1)):
+        if target == list1[i]:
+            return i
+        elif target < list1[i]:
+            return i
+    else:
+        return len(list1)
 ```
-**Example 2**
-```text
-Input: s = "bbbbb"
-Output: 1
-Explanation: The answer is "b", with the length of 1.
+## Linux Adminstration
+### 1.  List processes listening on ports 80 and 443
+```bash
+sudo netstat -tnlp | grep :443
+sudo netstat -tnlp | grep :80
 ```
-**Example 3**
-```text
-Input: s = "pwwkew"
-Output: 3
-Explanation: The answer is "wke", with the length of 3.
-Notice that the answer must be a substring, "pwke" is a subsequence and not a substring.
+### 2.  List process environment variables by given PID
+```bash
+cat /proc/[process ID]/environ | tr '\0' '\n'
 ```
-**Example 3**
-```text
-Input: s = ""
-Output: 0
+### 3.  Launch a python program  `my_program.py`  through CLI in the background. How would you close it after some period of time?
+```bash
+nohup ./my_program.py &
+ps -ef | grep my_program.py
+kill -9 [PID]
 ```
 
-3. Given a sorted array of distinct integers and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order.
 
-**Example:**
-```text
-Input: nums = [1,3,5,6], target = 5
-Output: 2
-```
 
-### Linux Shell
-1. List processes listening on ports 80 and 443
-2. List process environment variables by given PID
-3. Launch a python program `my_program.py` through CLI in the background. How would you close it after some period of time?