[Enhancement] Explicit steps to bring previously-Textracted data

In some cases, users may already have run their corpus through Amazon Textract and want to get started with the sample without taking the cost of re-processing all documents.

Although there's nothing preventing this in the model training code itself today, the notebook walkthrough steps often make S3 structure assumptions. More explicit guidance could greatly reduce the notebook debugging currently required to use pre-Textracted data.

### Context

Although the model training itself has a pretty broad interface for accepting JSON-lines manifests like:
```jsonc
{
    "source-ref": "s3://.../.../wherever-your-page-thumbnail-image-is.png",  // images_prefix = "s3://.../..."
    "textract-ref": "s3://.../.../corresponding-textract-result.json", // textract_prefix = "s3://.../..."
    "page-num": 2,  // 1-based number of this page in the textract-ref result
    "labels": { "some-smgt-": "-bbox-compatible-label" },
}
```

...The notebook sections for preparing/curating the dataset and visualizing results often make more explicit assumptions like:
- Textract refs correspond 1:1 with input documents and are at `input-doc-path.pdf/consolidated.json`
- Page thumbnail & full-size images have their S3 paths constructed in particular ways from the raw document URIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Explicit steps to bring previously-Textracted data #17

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Explicit steps to bring previously-Textracted data #17

Description

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions