Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
FROM node:18-alpine

RUN mkdir -p /usr/src/app
COPY . /usr/src/app
WORKDIR /usr/src/app/component
ENTRYPOINT [ "node", "/usr/src/app/component/dist/index.js" ]
66 changes: 65 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,66 @@
# magda-csv-semantic-indexer
A Magda semantic indexer can index CSV files

![Version: 1.0.0-alpha.0](https://img.shields.io/badge/Version-1.0.0--alpha.0-informational?style=flat-square)

A Helm chart for Magda CSV Semantic Indexer

**Homepage:** <https://github.com/magda-io/magda-csv-semantic-indexer>

## Source Code

* <https://github.com/magda-io/magda-csv-semantic-indexer>

## Requirements

Kubernetes: `>= 1.14.0-0`

| Repository | Name | Version |
|------------|------|---------|
| oci://ghcr.io/magda-io/charts | magda-common | 5.2.0 |

## Values

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| defaultAdminUserId | string | `"00000000-0000-4000-8000-000000000000"` | |
| defaultImage.imagePullSecret | bool | `false` | |
| defaultImage.pullPolicy | string | `"IfNotPresent"` | |
| defaultImage.repository | string | `"ghcr.io/magda-io"` | |
| defaultSemanticIndexerConfig.bulkEmbeddingsSize | int | `1` | |
| defaultSemanticIndexerConfig.bulkIndexSize | int | `50` | |
| defaultSemanticIndexerConfig.chunkSizeLimit | int | `512` | |
| defaultSemanticIndexerConfig.chunkSizeLimit | int | `512` | |
| defaultSemanticIndexerConfig.id | string | `"csv-semantic-indexer"` | |
| defaultSemanticIndexerConfig.indexName | string | `"semantic-index"` | |
| defaultSemanticIndexerConfig.indexVersion | int | `1` | |
| defaultSemanticIndexerConfig.overlap | int | `50` | |
| defaultSemanticIndexerConfig.overlap | int | `50` | |
| embeddingApiURL | string | `"http://magda-embedding-api"` | |
| global | object | `{"image":{},"rollingUpdate":{},"searchEngine":{"defaultDatasetBucket":"magda-datasets","semanticIndexer":{"indexName":null,"indexVersion":null,"knnVectorFieldConfig":{"compressionLevel":"32x","dimension":768,"efConstruction":100,"efSearch":100,"m":16,"mode":"on_disk","spaceType":"l2"},"numberOfReplicas":0,"numberOfShards":1}}}` | only for providing appropriate default value for helm lint |
| global.searchEngine.semanticIndexer.knnVectorFieldConfig.compressionLevel | string | `"32x"` | The compression_level mapping parameter selects a quantization encoder that reduces vector memory consumption by the given factor. |
| global.searchEngine.semanticIndexer.knnVectorFieldConfig.dimension | int | `768` | Dimension of the embedding vectors. |
| global.searchEngine.semanticIndexer.knnVectorFieldConfig.efConstruction | int | `100` | Similar to efSearch but used during index construction. Higher values improve search quality but increase index build time. |
| global.searchEngine.semanticIndexer.knnVectorFieldConfig.efSearch | int | `100` | The size of the candidate queue during search. Larger values may improve search quality but increase search latency. |
| global.searchEngine.semanticIndexer.knnVectorFieldConfig.m | int | `16` | The maximum number of graph edges per vector. Higher values increase memory usage but may improve search quality. |
| global.searchEngine.semanticIndexer.knnVectorFieldConfig.mode | string | `"on_disk"` | Vector workload mode: `on_disk` or `in_memory`. |
| image.name | string | `"magda-csv-semantic-indexer"` | |
| minioConfig.defaultDatasetBucket | string | `""` | |
| minioConfig.endPoint | string | `"magda-minio"` | |
| minioConfig.port | int | `9000` | |
| minioConfig.region | string | `""` | |
| minioConfig.useSSL | bool | `false` | |
| opensearchURL | string | `"http://opensearch:9200"` | |
| port | int | `6305` | Service port configuration |
| resources.limits.cpu | string | `"100m"` | |
| resources.requests.cpu | string | `"50m"` | |
| resources.requests.memory | string | `"200Mi"` | |
| semanticIndexer.bulkEmbeddingsSize | int | `nil` | number of string we request embedding api to process in one request |
| semanticIndexer.bulkIndexSize | int | `nil` | Number of documents we send to OpenSearch for bulk processing in a single request |
| semanticIndexer.chunkSizeLimit | int | `nil` | The maximum number of tokens in a single chunk. |
| semanticIndexer.id | string | `""` | Semantic indexer ID |
| semanticIndexer.indexName | string | `nil` | index name |
| semanticIndexer.indexVersion | int | `nil` | index version |
| semanticIndexer.overlap | int | `nil` | The number of overlapping tokens between chunks. |

----------------------------------------------
Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0)
12 changes: 12 additions & 0 deletions deploy/magda-csv-semantic-indexer/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: v2
name: magda-csv-semantic-indexer
version: 1.0.0-alpha.0
kubeVersion: ">= 1.14.0-0"
description: A Helm chart for Magda CSV Semantic Indexer
home: "https://github.com/magda-io/magda-csv-semantic-indexer"
sources:
- https://github.com/magda-io/magda-csv-semantic-indexer
dependencies:
- name: magda-common
version: "5.2.0"
repository: "oci://ghcr.io/magda-io/charts"
27 changes: 27 additions & 0 deletions deploy/magda-csv-semantic-indexer/templates/_helper.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{{- define "magda-csv-semantic-indexer.semanticIndexer.values" }}
{{- $semanticIndexer := get .Values "semanticIndexer" | default dict }}
{{- $globalSemanticIndexer := get .Values.global.searchEngine "semanticIndexer" | default dict }}
{{- $defaultConfig := .Values.defaultSemanticIndexerConfig }}

{{- $id := .Values.semanticIndexer.id | default $defaultConfig.id }}
{{- $indexVersion := .Values.semanticIndexer.indexVersion | default (get $globalSemanticIndexer "indexVersion") | default $defaultConfig.indexVersion }}
{{- $actualIndexName := .Values.semanticIndexer.indexName | default (get $globalSemanticIndexer "indexName") | default $defaultConfig.indexName }}
{{- $chunkSizeLimit := .Values.semanticIndexer.chunkSizeLimit | default $defaultConfig.chunkSizeLimit }}
{{- $overlap := .Values.semanticIndexer.overlap | default $defaultConfig.overlap }}
{{- $bulkEmbeddingsSize := .Values.semanticIndexer.bulkEmbeddingsSize | default $defaultConfig.bulkEmbeddingsSize }}
{{- $bulkIndexSize := .Values.semanticIndexer.bulkIndexSize | default $defaultConfig.bulkIndexSize }}

{{- $_ := set $semanticIndexer "id" $id }}
{{- $_ := set $semanticIndexer "numberOfShards" (get $globalSemanticIndexer "numberOfShards") }}
{{- $_ := set $semanticIndexer "numberOfReplicas" (get $globalSemanticIndexer "numberOfReplicas") }}
{{- $_ := set $semanticIndexer "knnVectorFieldConfig" (get $globalSemanticIndexer "knnVectorFieldConfig") }}

{{- $_ := set $semanticIndexer "indexName" $actualIndexName }}
{{- $_ := set $semanticIndexer "indexVersion" $indexVersion }}
{{- $_ := set $semanticIndexer "chunkSizeLimit" $chunkSizeLimit }}
{{- $_ := set $semanticIndexer "overlap" $overlap }}
{{- $_ := set $semanticIndexer "bulkEmbeddingsSize" $bulkEmbeddingsSize }}
{{- $_ := set $semanticIndexer "bulkIndexSize" $bulkIndexSize }}

{{- $semanticIndexer | mustToRawJson }}
{{- end }}
12 changes: 12 additions & 0 deletions deploy/magda-csv-semantic-indexer/templates/configmap.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: "{{ .Chart.Name }}-config"
data:
semantic-indexer.json: {{ (include "magda-csv-semantic-indexer.semanticIndexer.values" .) | quote }}

{{- $minioConfig := .Values.minioConfig }}
{{- $finalBucket := (.Values.minioConfig.defaultDatasetBucket | default .Values.global.defaultDatasetBucket | default "magda-datasets") }}
{{- $_ := set $minioConfig "defaultDatasetBucket" $finalBucket }}
minio.json: {{ $minioConfig | mustToRawJson | quote }}

88 changes: 88 additions & 0 deletions deploy/magda-csv-semantic-indexer/templates/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: magda-csv-semantic-indexer
spec:
replicas: 1
strategy:
rollingUpdate:
maxUnavailable: {{ .Values.global.rollingUpdate.maxUnavailable | default 0 }}
selector:
matchLabels:
service: magda-csv-semantic-indexer
template:
metadata:
labels:
service: magda-csv-semantic-indexer
spec:
{{- include "magda.imagePullSecrets" . | indent 6 }}
containers:
- name: magda-csv-semantic-indexer
image: {{ include "magda.image" . | quote }}
imagePullPolicy: {{ include "magda.imagePullPolicy" . | quote }}
command: [
"node",
"/usr/src/app/component/dist/index.js",
"--semanticIndexerConfig", "/etc/config/semantic-indexer.json",
"--minioConfig", "/etc/config/minio.json",
"--opensearchApiURL", "{{ .Values.opensearchURL }}",
"--embeddingApiURL", "{{ .Values.embeddingApiURL }}",
"--id", "{{ .Values.semanticIndexer.id | default .Values.defaultSemanticIndexerConfig.id }}",
"--chunkSizeLimit", "{{ .Values.semanticIndexer.chunkSizeLimit | default .Values.defaultSemanticIndexerConfig.chunkSizeLimit }}",
"--overlap", "{{ .Values.semanticIndexer.overlap | default .Values.defaultSemanticIndexerConfig.overlap }}"
]
{{- if .Values.global.enableLivenessProbes }}
livenessProbe:
httpGet:
path: "/healthz"
port: 80
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 10
{{- end }}
resources:
{{ toYaml .Values.resources | indent 10 }}
env:
- name: NODE_PORT
value: {{ .Values.port | quote }}
- name: REGISTRY_URL
value: "http://registry-api/v0"
- name: REGISTRY_READ_ONLY_URL
value: "http://registry-api-read-only/v0"
- name: ENABLE_MULTI_TENANTS
{{- if .Values.global.enableMultiTenants }}
value: "true"
{{- else }}
value: "false"
{{- end }}
- name: TENANT_URL
value: "http://tenant-api/v0"
- name: USER_ID
value: {{ .Values.global.defaultAdminUserId | default .Values.defaultAdminUserId }}
- name: INTERNAL_URL
value: "http://magda-csv-semantic-indexer"
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: auth-secrets
key: jwt-secret
- name: MINIO_SECRET_KEY
valueFrom:
secretKeyRef:
name: storage-secrets
key: secretkey
- name: MINIO_ACCESS_KEY
valueFrom:
secretKeyRef:
name: storage-secrets
key: accesskey
- name: PORT
value: "{{ .Values.port }}"
volumeMounts:
- name: "{{ .Chart.Name }}-config"
mountPath: "/etc/config"
readOnly: true
volumes:
- name: "{{ .Chart.Name }}-config"
configMap:
name: "{{ .Chart.Name }}-config"
11 changes: 11 additions & 0 deletions deploy/magda-csv-semantic-indexer/templates/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: v1
kind: Service
metadata:
name: "magda-csv-semantic-indexer"
spec:
ports:
- name: http
port: 80
targetPort: {{ .Values.port }}
selector:
service: magda-csv-semantic-indexer
91 changes: 91 additions & 0 deletions deploy/magda-csv-semantic-indexer/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# -- only for providing appropriate default value for helm lint
global:
image: {}
rollingUpdate: {}
searchEngine:
semanticIndexer:
indexName:
indexVersion:
numberOfShards: 1
numberOfReplicas: 0
knnVectorFieldConfig:
# -- Vector workload mode: `on_disk` or `in_memory`.
mode: "on_disk"
# -- Dimension of the embedding vectors.
dimension: 768
# -- The compression_level mapping parameter selects a quantization encoder that reduces vector memory consumption by the given factor.
compressionLevel: 32x
# Supported values: l1, l2, innerProduct, cosine, linf
spaceType: "l2"
# -- Similar to efSearch but used during index construction. Higher values improve search quality but increase index build time.
efConstruction: 100
# -- The size of the candidate queue during search. Larger values may improve search quality but increase search latency.
efSearch: 100
# -- The maximum number of graph edges per vector. Higher values increase memory usage but may improve search quality.
m: 16
# -- FAISS Encoder configuration (If compressionLevel is set, encoder will be ignored).
# encoder: null
# name: "sq"
# type: "fp16"
# clip: false
defaultDatasetBucket: "magda-datasets"

opensearchURL: http://opensearch:9200
embeddingApiURL: http://magda-embedding-api

# -- Service port configuration
port: 6305

defaultSemanticIndexerConfig:
id: "csv-semantic-indexer"
chunkSizeLimit: 512
overlap: 50
indexName: "semantic-index"
indexVersion: 1
chunkSizeLimit: 512
overlap: 50
bulkEmbeddingsSize: 1
bulkIndexSize: 50

semanticIndexer:
# -- (string) Semantic indexer ID
id: ""
# -- (string) index name
indexName:
# -- (int) index version
indexVersion:
# -- (int) The maximum number of tokens in a single chunk.
chunkSizeLimit:
# -- (int) The number of overlapping tokens between chunks.
overlap:
# -- (int) number of string we request embedding api to process in one request
bulkEmbeddingsSize:
# -- (int) Number of documents we send to OpenSearch for bulk processing in a single request
bulkIndexSize:

minioConfig:
endPoint: "magda-minio"
port: 9000
region: ""
useSSL: false
defaultDatasetBucket: ""

image:
name: "magda-csv-semantic-indexer"
# tag:
# pullPolicy:
# imagePullSecret:

defaultImage:
repository: ghcr.io/magda-io
pullPolicy: IfNotPresent
imagePullSecret: false

defaultAdminUserId: "00000000-0000-4000-8000-000000000000"

resources:
requests:
cpu: 50m
memory: 200Mi
limits:
cpu: 100m
Empty file added deploy/test-deploy.yaml
Empty file.
Loading
Loading