Skip to content

[Enhancement] Explicit steps to bring previously-Textracted data #17

Open
@athewsey

Description

@athewsey

In some cases, users may already have run their corpus through Amazon Textract and want to get started with the sample without taking the cost of re-processing all documents.

Although there's nothing preventing this in the model training code itself today, the notebook walkthrough steps often make S3 structure assumptions. More explicit guidance could greatly reduce the notebook debugging currently required to use pre-Textracted data.

Context

Although the model training itself has a pretty broad interface for accepting JSON-lines manifests like:

{
    "source-ref": "s3://.../.../wherever-your-page-thumbnail-image-is.png",  // images_prefix = "s3://.../..."
    "textract-ref": "s3://.../.../corresponding-textract-result.json", // textract_prefix = "s3://.../..."
    "page-num": 2,  // 1-based number of this page in the textract-ref result
    "labels": { "some-smgt-": "-bbox-compatible-label" },
}

...The notebook sections for preparing/curating the dataset and visualizing results often make more explicit assumptions like:

  • Textract refs correspond 1:1 with input documents and are at input-doc-path.pdf/consolidated.json
  • Page thumbnail & full-size images have their S3 paths constructed in particular ways from the raw document URIs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions