Skip to content

Support jsonl file extension in S3 sink #5993

@Zhangxunmt

Description

@Zhangxunmt

Is your feature request related to a problem? Please describe.
When using Data Prepper to prepare Machine Learning batch job input files, many AWS AI services like Bedrock and SageMaker specifically require the input files to be in JSONL format with .jsonl extension. However, Data Prepper's S3 sink currently only supports .ndjson extension for JSON Lines format, requiring additional file renaming steps. Adding a new codec option in Data Prepper S3 sink to support .jsonl extension while maintaining the same format would improve compatibility with AWS AI/ML services. https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference-data.html.

Currently data perpper only has .ndjson extension in the jsonl format, we should add a new codec that saves the file in the same format but in .jsonl extension. https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sinks/s3/#codec

Image

Describe the solution you'd like
Allow the following codec in the S3 sink config, and saves the file in the .jsonl extension.

sink:
  - s3:
      codec:
        jsonl: {}

Additional context
#5509

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions