Skip to content
Open

Task #40

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
310 changes: 136 additions & 174 deletions dataeng/README.md
Original file line number Diff line number Diff line change
@@ -1,187 +1,149 @@
### Prerequisites
* Python 3.7 or greater
* Docker 19.03 or greater
* Git 2.28 or greater
* Postgres 13 or greater

## Level 1

### Files definitions:

- src_data - Path with source data needed to be processed.
- processed_data - Path with output processed data.
- user_id.jpg - User image file, for example, 0001.jpg. Could be several for different users in source data path.
- user_id.csv - User info file, for example, 0001.csv. Could be several for different users in source data path.

User csv file contains next columns:

1. first_name - User first name
2. last_name - User last name
3. birthts - User birthdate timestamp in milliseconds UTC

Test csv and img files could be found in the [02-src-data](./02-src-data) folder

**For example:**

```text
first_name, last_name, birthts
Ivan, Ivanov, 946674000000
# Read me

### This is the solution of Dataeng Internship task
#### There are 2 working modes
##### First Working mode:
Reading all the CSV files of the users and checking the previously processed output file, then filtering the duplicate user_ids and then updating the processed output file with the new users.
##### Second Working mode:
Only if we have an already processed output file, the client has the ability to edit one of the users in the database, He can change the first_name and the last_name of the selected user.
##### Code structure
The code is divided into a main part and 4 functions
1- A simple function for reading a single CSV file: list
2- A function that concatenates all the data in the CSV files
3- A function that writes data into a CSV file
4- A function that checks wether there's a processed output file before or not
The main function takes the selection of the working mode and then guides the program into the needed functionality.

## SQL Answers
### 1. Rewrite the SQL without subquery:
```SQL SELECT id
FROM USERS AS usertable
JOIN departments AS deptable
ON deptable.user_id = usertable.id
WHERE department_id !=1;
```

### Data processing description

1. Read csv file
2. Match images for each user
3. Combine data from CSV and image path
4. Update processed_data/output.csv CSV file and add new data. Important we can update data for previously processed
user. In output CSV and DB we should not duplicate records. Output CSV file format: user_id, first_name,
last_name, birthts, img_path

## Task

Implement a script to process files from the `src_data` folder.

## Results delivery format

Results should be implemented as a python script with demo data. Also should be
provided the README.md file with the description of your solution.

## Level 2
The same as **Level 1** with the following extras.

## Results delivery format

Results should be implemented as a service. The service should periodically read source data and process it.
Also, the service should implement web server with next endpoints:
- **GET** /data - get all records from DB in JSON format. Need to implement filtering by: is_image_exists = True/False, user min_age and max_age in years.
- **POST** /data - manually run data processing in src_data

Should be provided the README.md file with the description of your solution.

## Level 3
The same as **Level 2** but with next differences.

### Files definitions:
Source data and processed data should store in Minio. Minio service already defined in [docker-compose](./01-docker-compose/docker-compose.yml) file.

### Data processing description

1. Read csv file
2. Match images for each user
3. Combine data from CSV and image path
4. Update processed_data/output.csv CSV file and add new data. Important we can update data for previously processed
user. In output CSV and DB we should not to duplicate records. Output CSV file format: user_id, first_name,
last_name, birthts, img_path
5. Write this combined data to DB. Record should contain next columns: id, user_id, first_name, last_name, birthdate, img_path. id - autoincrement unique record id.
Postgres DB service already defined in [docker-compose](./01-docker-compose/docker-compose.yml)

## Results delivery format

Results should be implemented as a service. The service should periodically read source data and process it.
Also, the service should implement web server with next endpoints:
- **GET** /data - get all records from DB in JSON format. Need to implement filtering by: is_image_exists = True/False, user min_age and max_age in years.
- **POST** /data - manually run data processing in src_data

The solution should work in docker-compose. As base template can be taken [docker-compose](./01-docker-compose/docker-compose.yml) file.

**As a solution, you should implement one of the levels. You don't need to implement all of them, just choose the one you can solve.**
## Coding Tasks for Data Engineers
The following tasks cover different sections to check candidate's basic knowledge in SQL, Algorithms and Linux shell.

### SQL
1. Rewrite this SQL without subquery:
```sql
SELECT id
FROM users
WHERE id NOT IN (
SELECT user_id
FROM departments
WHERE department_id = 1
);
### 2. Write a SQL query to find all duplicate lastnames in a table named **user**
```SQL SELECT lastname, COUNT(lastname)
FROM USERS
GROUP BY lastname
HAVING COUNT(lastname) > 1;
```
2. Write a SQL query to find all duplicate lastnames in a table named **user**
```text
+----+-----------+-----------
| id | firstname | lastname |
+----+-----------+-----------
| 1 | Ivan | Sidorov |
| 2 | Alexandr | Ivanov |
| 3 | Petr | Petrov |
| 4 | Stepan | Ivanov |
+----+-----------+----------+
### 3. Write a SQL query to get a username from the **user** table with the second highest salary from **salary** tables. Show the username and it's salary in the result.
```SQL SELECT USERTABLE.username, SALARYTABLE.salary
FROM salary AS SALARYTABLE
INNER JOIN user AS USERTABLE
ON SALARYTABLE.user_id = USERTABLE.id
ORDER
BY SALARYTABLE.salary DESC
LIMIT 1 OFFSET 1;
```
3. Write a SQL query to get a username from the **user** table with the second highest salary from **salary** tables. Show the username and it's salary in the result.
```sql
+---------+--------+
| user_id | salary |
+----+--------+----+
| 1 | 1000 |
| 2 | 1100 |
| 3 | 900 |
| 4 | 1200 |
+---------+--------+
```
```sql
+---------+--------+
| id | username |
+----+--------+----+
| 1 | Alex |
| 2 | Maria |
| 3 | Bob |
| 4 | Sean |
+---------+-------+
## Algorithms & Datastructre
### 1. Optimization of the Python code snippet:
```python
from collections import Counter
def count_connections(list1: list, list2: list) -> int:
counter1 = Counter(list1)
counter2 = Counter(list2)
l1 = set(list1)
intersections = l1.intersection(list2)
sum = 0
for i in intersections:
sum += int(counter1[i]) * int(counter2[i])
return sum
```
### Algorithms and Data Structures
1. Optimise execution time of this Python code snippet:
### 2. Given a string `s`, find the length of the longest substring without repeating characters. Analyze your solution and please provide Space and Time complexities.
```python
def findLongestSubstring(string):
if len(string) == 0:
return 0
n = len(string)
# starting point of current substring.
st = 0
# maximum length substring without
# repeating characters. maxlen = 0
# starting index of maximum
# length substring. start = 0
# Hash Map to store last occurrence
# of each already visited character. pos = {}
# Last occurrence of first
# character is index 0 pos[string[0]] = 0
for i in range(1, n):
# If this character is not present in hash,
# then this is first occurrence of this # character, store this in hash. if string[i] not in pos:
pos[string[i]] = i
else:
# If this character is present in hash then
# this character has previous occurrence, # check if that occurrence is before or after # starting point of current substring. if pos[string[i]] >= st:

# find length of current substring and
# update maxlen and start accordingly. currlen = i - st
if maxlen < currlen:
maxlen = currlen
start = st
# Next substring will start after the last
# occurrence of current character to avoid # its repetition. st = pos[string[i]] + 1
# Update last occurrence of
# current character. pos[string[i]] = i
# Compare length of last substring with maxlen
# and update maxlen and start accordingly. if maxlen < i - st:
maxlen = i - st
start = st
# The required longest substring without
# repeating characters is from string[start] # to string[start+maxlen-1]. return string[start: start + maxlen]
```
def count_connections(list1: list, list2: list) -> int:
count = 0

for i in list1:
for j in list2:
if i == j:
count += 1

return count
**Time Complexity:** O(n)
**Auxiliary Space:** O(n)
### 3. Given a sorted array of distinct integers and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order.
```python
def binary_search(arr: list, low, high, target):
if target < arr[0]:
return 0
elif target > arr[-1]:
return len(arr)
# Check base case
if high >= low:
mid = (high + low) // 2
# If element is present at the middle itself
if arr[mid] == target:
return mid
# If element is smaller than mid, then it can only
# be present in left subarray elif arr[mid] > target:
return binary_search(arr, low, mid - 1, target)
# Else the element can only be present in right subarray
else:
return binary_search(arr, mid + 1, high, target)
else:
# Element is not present in the array, return the index where it should've been
return high + 1
```

2. Given a string `s`, find the length of the longest substring without repeating characters.
Analyze your solution and please provide Space and Time complexities.

**Example 1**
```text
Input: s = "abcabcbb"
Output: 3
Explanation: The answer is "abc", with the length of 3.
```python
def linear_search(list1: list, target):
if target < list1[0]:
return 0
for i in range(len(list1)):
if target == list1[i]:
return i
elif target < list1[i]:
return i
else:
return len(list1)
```
**Example 2**
```text
Input: s = "bbbbb"
Output: 1
Explanation: The answer is "b", with the length of 1.
## Linux Adminstration
### 1. List processes listening on ports 80 and 443
```bash
sudo netstat -tnlp | grep :443
sudo netstat -tnlp | grep :80
```
**Example 3**
```text
Input: s = "pwwkew"
Output: 3
Explanation: The answer is "wke", with the length of 3.
Notice that the answer must be a substring, "pwke" is a subsequence and not a substring.
### 2. List process environment variables by given PID
```bash
cat /proc/[process ID]/environ | tr '\0' '\n'
```
**Example 3**
```text
Input: s = ""
Output: 0
### 3. Launch a python program `my_program.py` through CLI in the background. How would you close it after some period of time?
```bash
nohup ./my_program.py &
ps -ef | grep my_program.py
kill -9 [PID]
```

3. Given a sorted array of distinct integers and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order.

**Example:**
```text
Input: nums = [1,3,5,6], target = 5
Output: 2
```

### Linux Shell
1. List processes listening on ports 80 and 443
2. List process environment variables by given PID
3. Launch a python program `my_program.py` through CLI in the background. How would you close it after some period of time?
Loading