Skip to content

feat: parallelization #352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
2599712
feat: handle parallel pagination for scrape list
RohitR311 Jan 14, 2025
0c655fc
feat: add support to handle parallel pagination
RohitR311 Jan 15, 2025
641e182
feat: add parallel scraping logic for click next pagination
RohitR311 Jan 15, 2025
352447b
feat: add parallel scraping support for scroll pagination types
RohitR311 Jan 15, 2025
4ffbcba
feat: rm parallelism for paginations except click next
RohitR311 Jan 15, 2025
4c0eb30
feat: estimate number of items per page
RohitR311 Jan 16, 2025
bd25950
chore: include all files from src dir
RohitR311 Jan 17, 2025
8360d3d
feat: add worker pool to support parallelism for click next
RohitR311 Jan 17, 2025
38539d5
feat: add worker pool for parallelization
RohitR311 Jan 17, 2025
e0b52c1
feat: add worker types
RohitR311 Jan 17, 2025
b2e8332
feat: rm worker pool logic
RohitR311 Jan 17, 2025
7d0339a
feat: rm worker pool logic
RohitR311 Jan 17, 2025
c4c77e6
Merge branch 'parallelization' of https://github.com/RohitR311/maxun …
RohitR311 Jan 17, 2025
5c6f478
feat: add kafka config
RohitR311 Jan 18, 2025
5becc84
feat: add kafka manager to create topics
RohitR311 Jan 19, 2025
5e8a6d1
feat: add scraper to scrape data and store in kafka
RohitR311 Jan 19, 2025
73c2cc3
feat: add kafka util to consume task data and produce messages
RohitR311 Jan 20, 2025
6984401
feat: add parallel scraping support using kafka
RohitR311 Jan 20, 2025
7105749
feat: add initial kafka setup script
RohitR311 Jan 20, 2025
a931a13
feat: add start consumer kafka script
RohitR311 Jan 20, 2025
5411484
feat: add limit in task config
RohitR311 Jan 20, 2025
7650794
chore: add kafka services
RohitR311 Jan 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 59 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,60 @@ services:
volumes:
- minio_data:/data

zookeeper:
image: confluentinc/cp-zookeeper:7.4.0
environment:
ZOOKEEPER_CLIENT_PORT: ${ZOOKEEPER_PORT:-2181}
ZOOKEEPER_TICK_TIME: 2000
ports:
- "${ZOOKEEPER_PORT:-2181}:2181"
volumes:
- zookeeper_data:/var/lib/zookeeper/data
- zookeeper_log:/var/lib/zookeeper/log
healthcheck:
test: ["CMD-SHELL", "zookeeper-shell.sh localhost:${ZOOKEEPER_PORT:-2181} ls /"]
interval: 10s
timeout: 5s
retries: 5

# Add Kafka
kafka:
image: confluentinc/cp-kafka:7.4.0
depends_on:
- zookeeper
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:${ZOOKEEPER_PORT:-2181}
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:29092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
ports:
- "${KAFKA_PORT:-9092}:9092"
- "${KAFKA_EXTERNAL_PORT:-29092}:29092"
volumes:
- kafka_data:/var/lib/kafka/data
healthcheck:
test: ["CMD-SHELL", "kafka-topics.sh --bootstrap-server localhost:9092 --list"]
interval: 10s
timeout: 5s
retries: 5

kafka-ui:
image: provectuslabs/kafka-ui:latest
container_name: kafka-ui
ports:
- "${KAFKA_UI_PORT:-9080}:8080"
environment:
KAFKA_CLUSTERS_0_NAME: maxun-cluster
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:9092
KAFKA_CLUSTERS_0_ZOOKEEPER: zookeeper:2181
depends_on:
- kafka

Comment on lines +84 to +95
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pin the kafka-ui image version.

Using the latest tag can lead to unexpected behavior when the image is updated. Pin to a specific version for reproducible builds.

- image: provectuslabs/kafka-ui:latest
+ image: provectuslabs/kafka-ui:v0.7.1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
kafka-ui:
image: provectuslabs/kafka-ui:latest
container_name: kafka-ui
ports:
- "${KAFKA_UI_PORT:-9080}:8080"
environment:
KAFKA_CLUSTERS_0_NAME: maxun-cluster
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:9092
KAFKA_CLUSTERS_0_ZOOKEEPER: zookeeper:2181
depends_on:
- kafka
kafka-ui:
image: provectuslabs/kafka-ui:v0.7.1
container_name: kafka-ui
ports:
- "${KAFKA_UI_PORT:-9080}:8080"
environment:
KAFKA_CLUSTERS_0_NAME: maxun-cluster
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:9092
KAFKA_CLUSTERS_0_ZOOKEEPER: zookeeper:2181
depends_on:
- kafka

backend:
#build:
#context: .
Expand All @@ -63,6 +117,7 @@ services:
- postgres
- redis
- minio
- kafka
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add Kafka configuration for backend service.

The backend service now depends on Kafka but lacks the necessary environment variables for connection configuration.

Add these environment variables to the backend service:

  environment:
+   KAFKA_BROKERS: kafka:9092
+   KAFKA_CLIENT_ID: maxun-backend
+   KAFKA_CONSUMER_GROUP_ID: maxun-scraping-group

Committable suggestion skipped: line range outside the PR's diff.

volumes:
- /var/run/dbus:/var/run/dbus

Expand All @@ -83,4 +138,7 @@ services:
volumes:
postgres_data:
minio_data:
redis_data:
redis_data:
kafka_data:
zookeeper_data:
zookeeper_log:
10 changes: 10 additions & 0 deletions maxun-core/src/config/kafka.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
export const kafkaConfig = {
clientId: 'maxun-scraper',
brokers: ['localhost:29092'],
topics: {
Comment on lines +1 to +4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use environment variables for Kafka configuration.

The broker address and client ID are hardcoded, which limits deployment flexibility and poses security risks. Consider using environment variables for configuration.

Apply this diff to make the configuration more flexible:

+import { config } from 'dotenv';
+
+config();
+
 export const kafkaConfig = {
-    clientId: 'maxun-scraper',
-    brokers: ['localhost:29092'],
+    clientId: process.env.KAFKA_CLIENT_ID || 'maxun-scraper',
+    brokers: (process.env.KAFKA_BROKERS || 'localhost:29092').split(','),

Also, consider adding SSL/authentication configuration for production environments:

    // Add these properties to kafkaConfig
    ssl: process.env.KAFKA_SSL === 'true',
    sasl: process.env.KAFKA_SASL === 'true' ? {
      mechanism: process.env.KAFKA_SASL_MECHANISM,
      username: process.env.KAFKA_USERNAME,
      password: process.env.KAFKA_PASSWORD,
    } : undefined,

SCRAPING_TASKS: 'scraping-tasks',
SCRAPING_RESULTS: 'scraping-results',
SCRAPING_DLQ: 'scraping-dlq'
},
consumerGroup: 'scraping-group'
};
Comment on lines +9 to +10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add environment separation for consumer groups.

The consumer group should be environment-aware to prevent cross-environment message processing.

Apply this diff:

-    consumerGroup: 'scraping-group'
+    consumerGroup: `${process.env.ENV || 'dev'}-scraping-group`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
consumerGroup: 'scraping-group'
};
consumerGroup: `${process.env.ENV || 'dev'}-scraping-group`
};

Comment on lines +1 to +10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Remove unused Kafka configuration

The Kafka configuration appears to be unused in the codebase. There are no imports of this configuration, no Kafka client instantiations, and no other Kafka-related code present. Consider removing this unused configuration to maintain a clean codebase.

🔗 Analysis chain

Verify Kafka configuration in the codebase.

Let's ensure the configuration is properly utilized across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for proper usage of Kafka configuration and potential security issues

# Check for hardcoded Kafka configurations
echo "Checking for hardcoded Kafka configurations..."
rg -g '!*.{json,md}' -i 'localhost:29092'

# Check for proper import usage
echo "Checking kafka config imports..."
rg "from ['|\"].*config/kafka" -A 3

# Check for potential security-sensitive Kafka operations
echo "Checking for direct Kafka operations..."
ast-grep --pattern 'new Kafka($$$)'

Length of output: 451


Script:

#!/bin/bash
# Search for Kafka-related code and configurations

# Check for usage of kafkaConfig
echo "Checking for kafkaConfig usage..."
rg "kafkaConfig"

# Check for usage of topic names
echo "Checking for topic name usage..."
rg "scraping-tasks|scraping-results|scraping-dlq"

# Check for any Kafka-related imports or code
echo "Checking for Kafka-related code..."
rg -i "kafka" -g "!*.{json,md,txt}" -A 3

Length of output: 844

Loading