DocAI Form Parser microservice #12

anuradha-bajpai-google · 2024-08-07T16:52:25Z

DocAI Form Parser microservice with Cloud Run

- Re-sync'd the development constraints (shared) based on the form parser requirements.in - Moved requirements.txt to requirements.in for form parser - Updated tasks.py to also generate requirements.txt from requirements.in - Reformatted terraform from pre-commit

* DocAI Form Parser microservice (#12) * DocAI form parser processor integration * form processor build conatiner image script * DocAI form parser code integration * DocAI Form Parser fixes * Changes: - Re-sync'd the development constraints (shared) based on the form parser requirements.in - Moved requirements.txt to requirements.in for form parser - Updated tasks.py to also generate requirements.txt from requirements.in - Reformatted terraform from pre-commit --------- Co-authored-by: Mark Scannell <[email protected]> * down stream tasks only depends on the supported files are moved but will wait for pdf form processor to finish (#13) * made downstream tasks only depends on files move but wait for pdf forms files moved * ignore pylint import errors * form-parser-metadata-load-bigquery * Composer task to trigger Doc AI Form Parser Cloud Run- first version * form-parser-metadata-load-bigquery (#18) * form-parser-metadata-load-bigquery * fixes in form parser * Updated README.md * Updated DPU to EKS and user agent string and label for revenue tracking * skip pre-commit * removing pre-commit check for terraform fmt --------- Co-authored-by: Dharmesh Patel <[email protected]> * Updated Ref. Arch. diagram and added DATAFLOW.md * Changed labels to eks-solution * Composer task to trigger DocAI Form Parser and metadata update for Form parser * Updated labels for tracking (#19) * DocAI form API microservice trigger from Cloud Composer * fix in form parser * Fixed type issue from assigning a `str | None` type to `str` type when reading environment variables. This is done by calling the `os.environ[]` instead of `os.environ.get()` method. This will fail fast if the environment variable does not exist. * batch deletion based on batch-id (#16) * location parameter and batch-id based deletion * Updated the README.md for batch delete. * updated the delete_doc.sh script --------- Co-authored-by: Dharmesh Patel <[email protected]> * Parallelized form parsing and docs parsing, including importing to the data store. (#21) Co-authored-by: Eyal Ben Ivri <[email protected]> * Add docx support (#24) * Fix bug where flow fails if only unsupported files exists in the input bucket. * Fix bug where flow tries to trigger forms parsing job even if no forms were detected * Added docx support in doc-processing job and the dag --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * Fix move files bug with multiple file types (#23) * Fix bug where flow fails if only unsupported files exists in the input bucket. * Fix bug where flow tries to trigger forms parsing job even if no forms were detected --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * Deletion script fix (#22) * correct gcs folder deletion and bq table deletion * adapt gcs rm logic with form parser output format * Documentation changes for release 1.2. * Minor updates to the README.md * Minor change to README.md --------- Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]>

* DocAI Form Parser microservice (#12) * DocAI form parser processor integration * form processor build conatiner image script * DocAI form parser code integration * DocAI Form Parser fixes * Changes: - Re-sync'd the development constraints (shared) based on the form parser requirements.in - Moved requirements.txt to requirements.in for form parser - Updated tasks.py to also generate requirements.txt from requirements.in - Reformatted terraform from pre-commit --------- Co-authored-by: Mark Scannell <[email protected]> * down stream tasks only depends on the supported files are moved but will wait for pdf form processor to finish (#13) * made downstream tasks only depends on files move but wait for pdf forms files moved * ignore pylint import errors * form-parser-metadata-load-bigquery * Composer task to trigger Doc AI Form Parser Cloud Run- first version * form-parser-metadata-load-bigquery (#18) * form-parser-metadata-load-bigquery * fixes in form parser * Updated README.md * Updated DPU to EKS and user agent string and label for revenue tracking * skip pre-commit * removing pre-commit check for terraform fmt --------- Co-authored-by: Dharmesh Patel <[email protected]> * Updated Ref. Arch. diagram and added DATAFLOW.md * Changed labels to eks-solution * Composer task to trigger DocAI Form Parser and metadata update for Form parser * Updated labels for tracking (#19) * DocAI form API microservice trigger from Cloud Composer * fix in form parser * Fixed type issue from assigning a `str | None` type to `str` type when reading environment variables. This is done by calling the `os.environ[]` instead of `os.environ.get()` method. This will fail fast if the environment variable does not exist. * batch deletion based on batch-id (#16) * location parameter and batch-id based deletion * Updated the README.md for batch delete. * updated the delete_doc.sh script --------- Co-authored-by: Dharmesh Patel <[email protected]> * Parallelized form parsing and docs parsing, including importing to the data store. (#21) Co-authored-by: Eyal Ben Ivri <[email protected]> * refactored many of the operations in the DAG to a utils package, to reduce complixty and code in the DAG file, and move logic to other files, where the logic is seperated from the airflow runtime. * reordered dag steps and dependencies to optimize runtime * down stream tasks only depends on the supported files are moved but will wait for pdf form processor to finish (#13) * made downstream tasks only depends on files move but wait for pdf forms files moved * ignore pylint import errors * Composer task to trigger Doc AI Form Parser Cloud Run- first version * Composer task to trigger DocAI Form Parser and metadata update for Form parser * DocAI form API microservice trigger from Cloud Composer * form-parser-metadata-load-bigquery (#18) * form-parser-metadata-load-bigquery * fixes in form parser * Updated README.md * Updated DPU to EKS and user agent string and label for revenue tracking * skip pre-commit * removing pre-commit check for terraform fmt --------- Co-authored-by: Dharmesh Patel <[email protected]> * Updated labels for tracking (#19) * Fixed type issue from assigning a `str | None` type to `str` type when reading environment variables. This is done by calling the `os.environ[]` instead of `os.environ.get()` method. This will fail fast if the environment variable does not exist. * Parallelized form parsing and docs parsing, including importing to the data store. (#21) Co-authored-by: Eyal Ben Ivri <[email protected]> * refactored many of the operations in the DAG to a utils package, to reduce complixty and code in the DAG file, and move logic to other files, where the logic is seperated from the airflow runtime. * reordered dag steps and dependencies to optimize runtime * removed commented out step * added license information to new files. * added license information to new files. * Copy all files in `src` folder to `dags` folder in GCS --------- Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]>

* DocAI form parser processor integration * form processor build conatiner image script * DocAI form parser code integration * DocAI Form Parser fixes * Changes: - Re-sync'd the development constraints (shared) based on the form parser requirements.in - Moved requirements.txt to requirements.in for form parser - Updated tasks.py to also generate requirements.txt from requirements.in - Reformatted terraform from pre-commit --------- Co-authored-by: Mark Scannell <[email protected]>

* Minor changes to README.md (#29) * Rearrange dag flow (#27) * DocAI Form Parser microservice (#12) * DocAI form parser processor integration * form processor build conatiner image script * DocAI form parser code integration * DocAI Form Parser fixes * Changes: - Re-sync'd the development constraints (shared) based on the form parser requirements.in - Moved requirements.txt to requirements.in for form parser - Updated tasks.py to also generate requirements.txt from requirements.in - Reformatted terraform from pre-commit --------- Co-authored-by: Mark Scannell <[email protected]> * down stream tasks only depends on the supported files are moved but will wait for pdf form processor to finish (#13) * made downstream tasks only depends on files move but wait for pdf forms files moved * ignore pylint import errors * form-parser-metadata-load-bigquery * Composer task to trigger Doc AI Form Parser Cloud Run- first version * form-parser-metadata-load-bigquery (#18) * form-parser-metadata-load-bigquery * fixes in form parser * Updated README.md * Updated DPU to EKS and user agent string and label for revenue tracking * skip pre-commit * removing pre-commit check for terraform fmt --------- Co-authored-by: Dharmesh Patel <[email protected]> * Updated Ref. Arch. diagram and added DATAFLOW.md * Changed labels to eks-solution * Composer task to trigger DocAI Form Parser and metadata update for Form parser * Updated labels for tracking (#19) * DocAI form API microservice trigger from Cloud Composer * fix in form parser * Fixed type issue from assigning a `str | None` type to `str` type when reading environment variables. This is done by calling the `os.environ[]` instead of `os.environ.get()` method. This will fail fast if the environment variable does not exist. * batch deletion based on batch-id (#16) * location parameter and batch-id based deletion * Updated the README.md for batch delete. * updated the delete_doc.sh script --------- Co-authored-by: Dharmesh Patel <[email protected]> * Parallelized form parsing and docs parsing, including importing to the data store. (#21) Co-authored-by: Eyal Ben Ivri <[email protected]> * refactored many of the operations in the DAG to a utils package, to reduce complixty and code in the DAG file, and move logic to other files, where the logic is seperated from the airflow runtime. * reordered dag steps and dependencies to optimize runtime * down stream tasks only depends on the supported files are moved but will wait for pdf form processor to finish (#13) * made downstream tasks only depends on files move but wait for pdf forms files moved * ignore pylint import errors * Composer task to trigger Doc AI Form Parser Cloud Run- first version * Composer task to trigger DocAI Form Parser and metadata update for Form parser * DocAI form API microservice trigger from Cloud Composer * form-parser-metadata-load-bigquery (#18) * form-parser-metadata-load-bigquery * fixes in form parser * Updated README.md * Updated DPU to EKS and user agent string and label for revenue tracking * skip pre-commit * removing pre-commit check for terraform fmt --------- Co-authored-by: Dharmesh Patel <[email protected]> * Updated labels for tracking (#19) * Fixed type issue from assigning a `str | None` type to `str` type when reading environment variables. This is done by calling the `os.environ[]` instead of `os.environ.get()` method. This will fail fast if the environment variable does not exist. * Parallelized form parsing and docs parsing, including importing to the data store. (#21) Co-authored-by: Eyal Ben Ivri <[email protected]> * refactored many of the operations in the DAG to a utils package, to reduce complixty and code in the DAG file, and move logic to other files, where the logic is seperated from the airflow runtime. * reordered dag steps and dependencies to optimize runtime * removed commented out step * added license information to new files. * added license information to new files. * Copy all files in `src` folder to `dags` folder in GCS --------- Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> * WIP - initial setup to create a cloud run job * Minor changes to README.md (#29) * Rearrange dag flow (#27) * DocAI Form Parser microservice (#12) * DocAI form parser processor integration * form processor build conatiner image script * DocAI form parser code integration * DocAI Form Parser fixes * Changes: - Re-sync'd the development constraints (shared) based on the form parser requirements.in - Moved requirements.txt to requirements.in for form parser - Updated tasks.py to also generate requirements.txt from requirements.in - Reformatted terraform from pre-commit --------- Co-authored-by: Mark Scannell <[email protected]> * down stream tasks only depends on the supported files are moved but will wait for pdf form processor to finish (#13) * made downstream tasks only depends on files move but wait for pdf forms files moved * ignore pylint import errors * form-parser-metadata-load-bigquery * Composer task to trigger Doc AI Form Parser Cloud Run- first version * form-parser-metadata-load-bigquery (#18) * form-parser-metadata-load-bigquery * fixes in form parser * Updated README.md * Updated DPU to EKS and user agent string and label for revenue tracking * skip pre-commit * removing pre-commit check for terraform fmt --------- Co-authored-by: Dharmesh Patel <[email protected]> * Updated Ref. Arch. diagram and added DATAFLOW.md * Changed labels to eks-solution * Composer task to trigger DocAI Form Parser and metadata update for Form parser * Updated labels for tracking (#19) * DocAI form API microservice trigger from Cloud Composer * fix in form parser * Fixed type issue from assigning a `str | None` type to `str` type when reading environment variables. This is done by calling the `os.environ[]` instead of `os.environ.get()` method. This will fail fast if the environment variable does not exist. * batch deletion based on batch-id (#16) * location parameter and batch-id based deletion * Updated the README.md for batch delete. * updated the delete_doc.sh script --------- Co-authored-by: Dharmesh Patel <[email protected]> * Parallelized form parsing and docs parsing, including importing to the data store. (#21) Co-authored-by: Eyal Ben Ivri <[email protected]> * refactored many of the operations in the DAG to a utils package, to reduce complixty and code in the DAG file, and move logic to other files, where the logic is seperated from the airflow runtime. * reordered dag steps and dependencies to optimize runtime * down stream tasks only depends on the supported files are moved but will wait for pdf form processor to finish (#13) * made downstream tasks only depends on files move but wait for pdf forms files moved * ignore pylint import errors * Composer task to trigger Doc AI Form Parser Cloud Run- first version * Composer task to trigger DocAI Form Parser and metadata update for Form parser * DocAI form API microservice trigger from Cloud Composer * form-parser-metadata-load-bigquery (#18) * form-parser-metadata-load-bigquery * fixes in form parser * Updated README.md * Updated DPU to EKS and user agent string and label for revenue tracking * skip pre-commit * removing pre-commit check for terraform fmt --------- Co-authored-by: Dharmesh Patel <[email protected]> * Updated labels for tracking (#19) * Fixed type issue from assigning a `str | None` type to `str` type when reading environment variables. This is done by calling the `os.environ[]` instead of `os.environ.get()` method. This will fail fast if the environment variable does not exist. * Parallelized form parsing and docs parsing, including importing to the data store. (#21) Co-authored-by: Eyal Ben Ivri <[email protected]> * refactored many of the operations in the DAG to a utils package, to reduce complixty and code in the DAG file, and move logic to other files, where the logic is seperated from the airflow runtime. * reordered dag steps and dependencies to optimize runtime * removed commented out step * added license information to new files. * added license information to new files. * Copy all files in `src` folder to `dags` folder in GCS --------- Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> * improve deployment guidance and scripts for minimum IAM roles * enable containerscanning.googleapis.com to automate vuln scans with images are pushed to AR repo * configure oauth consent screen as part of bootstrap script * remove terraform steps to create google_iap_brand * cleanup guidance on how to use the script and input variables * remove boolean deploy_ui from terraform and its references, now that this step has been moved to the setup script it's not useful * cleanup: remove additional references to unneeded TF variables, rewrite some descriptive text for clarity * add custom role to IAP for service, tweak logic of checking creation of iAP resources * Fixed substitution issue in custom role, added additional App Engine roles needed for Web UI deployment * fix missing PROJECT_ID flag on service account creation step, and logic to validate that a service account is defined * fix edge case errors where user environment has a leftover configuration to use different billing quota project * WIP: 1. Seperated cloud build service account from the cloud run job tf, to the common infrastructure 2. renamed multiple tf variables in different modules to refer to specific cloud run jobs, instead of a generic one - the more cloud run jobs we have, the more confusing this gets 3. * WIP: 1. Seperated cloud build service account from the cloud run job tf, to the common infrastructure 2. renamed multiple tf variables in different modules to refer to specific cloud run jobs, instead of a generic one - the more cloud run jobs we have, the more confusing this gets 3. updated the classify docs CRJ with the logic moved from the composer files to call the docai batch api - results will still have to be parsed in composer, as CRJs have no "return/response" value, they can only write outputs to GCS, and have some other part read those to interpret the output. 4. * added role for service account to call batch process fixed issues with Dockerfile and terrafrom adapted parsing logic and moved copy of files to the parsing part - this is done since: 1. GCSToGCSOperator reports success even for missing objects which makes debugging harder and 2. has no logs... * added role for service account to call batch process fixed issues with Dockerfile and terrafrom adapted parsing logic and moved copy of files to the parsing part - this is done since: 1. GCSToGCSOperator reports success even for missing objects which makes debugging harder and 2. has no logs... * added role for service account to call batch process fixed issues with Dockerfile and terrafrom adapted parsing logic and moved copy of files to the parsing part - this is done since: 1. GCSToGCSOperator reports success even for missing objects which makes debugging harder and 2. has no logs... * added role for service account to call batch process fixed issues with Dockerfile and terrafrom adapted parsing logic and moved copy of files to the parsing part - this is done since: 1. GCSToGCSOperator reports success even for missing objects which makes debugging harder and 2. has no logs... * forgot some cleanup * added role for service account to call batch process fixed issues with Dockerfile and terrafrom adapted parsing logic and moved copy of files to the parsing part - this is done since: 1. GCSToGCSOperator reports success even for missing objects which makes debugging harder and 2. has no logs... * pyright linting * fix for dag deployment * Updated Web UI to highlight references in the search results. UI enhancments * re-added docx default configuration * fix pre-commit issue * Added new scripts for triggering the workflow in Composer and another script to find document from the Agent Builder Datastore (#62) Co-authored-by: Charlie Wang <[email protected]> * fix(setup): improve IAM setup and scripts to minimize user error (#66) * Add missing Cloud Asset Viewer role * refactor: sa creation and service account token creator to be created as part of setup script * refactor logic: make SERVICE_ACCOUNT_ID optional. use default name "deployer" for service_account_id if not set * Remove DOC_AI_REGION, DOC_AI_PROCESSOR_ID from manual setup steps, they are not used by setup scripts. It might be used by trigger_workflow.sh, but that script has logic to check and set those variables (among others) if not already set, so I want to simplify the deployment steps as much as possible * better error handling for unset variables, avoid a scary-looking (but non-blocking) error about unable to find service account, and minimize some noisy output from IAM commands * fix typos in readme * Added feature to move unsupported files to reject bucket (#65) * - Added XCom Rejected Files to the files processing step, with short circuit to not try to move files if no rejected files found - Divided flow into flowgroups to allow for better semanitc seperation of steps - this forces changes in the XCom Pull operations since this changes the task ids, with the taskgroup being the prefix for each task - added dag configuration to parse jinja templates as objects, so we can more easily pull from xcom dict and lists and push them directly to tasks, avoiding some pre-processing or dynamic task mappings, which causes overhead and code confusion (most notably when providing overrides dict to CRJ, but that used to be implemented as an `expand_kwargs` to the CRJ operator. * fix pre-commit issue * Added new scripts for triggering the workflow in Composer and another script to find document from the Agent Builder Datastore (#62) Co-authored-by: Charlie Wang <[email protected]> * Fixed some pyright linting * small fix to the trigger_workflow.sh to match configurations * fix(setup): improve IAM setup and scripts to minimize user error (#66) * Add missing Cloud Asset Viewer role * refactor: sa creation and service account token creator to be created as part of setup script * refactor logic: make SERVICE_ACCOUNT_ID optional. use default name "deployer" for service_account_id if not set * Remove DOC_AI_REGION, DOC_AI_PROCESSOR_ID from manual setup steps, they are not used by setup scripts. It might be used by trigger_workflow.sh, but that script has logic to check and set those variables (among others) if not already set, so I want to simplify the deployment steps as much as possible * better error handling for unset variables, avoid a scary-looking (but non-blocking) error about unable to find service account, and minimize some noisy output from IAM commands * fix typos in readme --------- Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: eeaton <[email protected]> * chore: fix syntax error * chore: improve error handling, don't output a scary looking error message when the setup script is wai * Processing refactor (#64) * move cloud_run deployment to terraform folder * reference the terraform folder * fix relative path * update requirements * remove cloud_run deployment * fix(setup): improve IAM setup and scripts to minimize user error (#66) * Add missing Cloud Asset Viewer role * refactor: sa creation and service account token creator to be created as part of setup script * refactor logic: make SERVICE_ACCOUNT_ID optional. use default name "deployer" for service_account_id if not set * Remove DOC_AI_REGION, DOC_AI_PROCESSOR_ID from manual setup steps, they are not used by setup scripts. It might be used by trigger_workflow.sh, but that script has logic to check and set those variables (among others) if not already set, so I want to simplify the deployment steps as much as possible * better error handling for unset variables, avoid a scary-looking (but non-blocking) error about unable to find service account, and minimize some noisy output from IAM commands * fix typos in readme * addressing review comments * remove check on none existing files * ignore generated build config --------- Co-authored-by: eeaton <[email protected]> * feat(docs): introduce the Enterprise Foundation Blueprint for an assumed set of capabilities that a customer has in place before deploying a Solution (#67) * Cleanup and reorginize classifer crj (#70) * Simplified as much as possible the doc-classifier setup and structure. * chore: improve error handling, don't output a scary looking error message when the setup script is wai * Processing refactor (#64) * move cloud_run deployment to terraform folder * reference the terraform folder * fix relative path * update requirements * remove cloud_run deployment * fix(setup): improve IAM setup and scripts to minimize user error (#66) * Add missing Cloud Asset Viewer role * refactor: sa creation and service account token creator to be created as part of setup script * refactor logic: make SERVICE_ACCOUNT_ID optional. use default name "deployer" for service_account_id if not set * Remove DOC_AI_REGION, DOC_AI_PROCESSOR_ID from manual setup steps, they are not used by setup scripts. It might be used by trigger_workflow.sh, but that script has logic to check and set those variables (among others) if not already set, so I want to simplify the deployment steps as much as possible * better error handling for unset variables, avoid a scary-looking (but non-blocking) error about unable to find service account, and minimize some noisy output from IAM commands * fix typos in readme * addressing review comments * remove check on none existing files * ignore generated build config --------- Co-authored-by: eeaton <[email protected]> * - Removed documentai permission (as it is not needed anymore). - fixed file detection issue in build.tf --------- Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: Elliot <[email protected]> Co-authored-by: Charlie Wang <[email protected]> * Cleaned up config managment so that file supported is configured once in the DAG, and passed to the processing job (#73) * - In the Processing CRJ, supported files is now given as an argument and are mapped to specific operations in the processing-msg job - Argument is given as a CLI argument, and is required - Special argparse.Action was implemented to parse key:value pairs into a dict - DAG was adapted to pass in argument to the CRJ Important Note: This means the permitted values in the DAG config are the same structure, but support different values, that are actually being passed ot the CRJ. * adapted trigger_workflow.sh script to match new configuration values * added missing new parameter to call --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * terraform provsioning of tf-remote-state resouces (#75) * ci: enable linting, formatting, deps, devcontainer (#71) * ci: enable linting, formatting, deps, devcontainer - Configure Super-linter CI jobs - Add a script to run linting and fixing locally - Configure a devcontainer using Super-linter as a base * non-existent "run-pre-commit.sh" file * fixes to pass checks locally on the following linters: BASH_EXEC ENV GITLEAKS JSON JSON_PRETTIER MARKDOWN MARKDOWN_PRETTIER NATURAL_LANGUAGE RENOVATE SHELL_SHFMT YAML_PRETTIER * remove ./external_modules/* generated by checkov tests and update gitignore * fixes for CHECKOV * fixes for .gitignore, TERRAFORM_TERRASCAN, BASH, ENV * fixes for TFLINT, CHECKOVconfig * fixes for PYTHON_FLAKE8, PYTHON_RUFF, PYTHON_MYPY * More local fixes, and standardize on the set to include in default superlinter environment * all tests passing locally * Add the FIX_MODE arguments that work reliably * additional fixes to match EDITORCONFIG * tweak CI configuration for running Lint job * fix new yaml issues introduced by last change * Checkov suddenly triggered CKV_GCP_62 for the first time, not sure why it never triggered before but we don't want it. (Legacy bucket access logs) * minor corrections after testing functional deployment * re-align tf variable names to match after lint fixes * address comments and validate remaining lint tests --------- Co-authored-by: Elliot <[email protected]> * WebUI provisioning on CloudRun (#72) * WebUI provisioning on CloudRun * fix policy settings * fix policy settings * remove app engine * remove app engine * remove app engine * add dependency to the cloud build * tf format * add permission iap access permission to domain users * to set * Output DNS record for WebUI * change domain reference * update dependency and documentation * fix merge issue * fix lint issues * adapting to the changed variable type * fix MARKDOWN_Prettier * fix TERRASCAN issues * disable TERRAFORM_TERRASCAN for now * fix cloud build issues --------- Co-authored-by: Elliot <[email protected]> * fix: remove setup steps to loosen orgpolicies (#79) * WebUI provisioning on CloudRun * fix policy settings * fix policy settings * remove app engine * remove app engine * remove app engine * add dependency to the cloud build * tf format * add permission iap access permission to domain users * to set * Output DNS record for WebUI * change domain reference * update dependency and documentation * fix merge issue * fix lint issues * adapting to the changed variable type * fix MARKDOWN_Prettier * fix TERRASCAN issues * disable TERRAFORM_TERRASCAN for now * fix: now that App Engine dependency is removed, remove setup steps related to loosening the org policy --------- Co-authored-by: Charlie Wang <[email protected]> * revert format changes on pyproject.toml introduced by lint that cause issues when running invoke.sh * ci: Add conventional commit checks to GitHub Actions (#82) * Add conventional commit checks to GitHub Actions * Fix lint issues * set `add_label` to false * WIP - not working added alloydb instance, according to https://github.com/GoogleCloudPlatform/terraform-google-alloy-db/blob/main/examples/simple_example/main.tf currently still trying to decipher the vpc private ip allocation, which doesn't work for me (https://cloud.google.com/alloydb/docs/project-enable-access#gcloud), and the authentication issue (if it's required, and if so, how to manage it) * Looks like we don't need to setup authentication in alloydb instance creation - let's see if the IAM permissions for the using components would be enough. Removed for now the variables. * Cleaned up config managment so that file supported is configured once in the DAG, and passed to the processing job (#73) * - In the Processing CRJ, supported files is now given as an argument and are mapped to specific operations in the processing-msg job - Argument is given as a CLI argument, and is required - Special argparse.Action was implemented to parse key:value pairs into a dict - DAG was adapted to pass in argument to the CRJ Important Note: This means the permitted values in the DAG config are the same structure, but support different values, that are actually being passed ot the CRJ. * adapted trigger_workflow.sh script to match new configuration values * added missing new parameter to call --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * Updated alloydb deployment Added user for the form parser job changed documentation of the DAG in README.md + a few small typos. * Added a local file to .gitignore * linting issues. * fixes to pass lint tests locally * pin tf module for alloydb with a commit hash to pass CKV_TF_1 * linting issues. * linting issues. * linting issues. * fix: Webui build change (#83) * change the webui container build process * add gcloud ignore file * more dir to ignore by the build * lint fix * lint fix * lint fix * lint fix * review comments * fix docker build with using venv for specific user access * revert to old method signature * correct location for requirements.txt * fix lint issues * update deployer service account to include roles for AlloyDB * fix: Better folder structure for rejected files (#87) * fix import issue * linter * processor running out of memory * copy relative folder structure for rejected files * linter * default to agent builder processor for excel files * docs: add CONTRIBUTING.md (#80) * draft CONTIBUTING.md * regenerate the requirements.txt files using releative path references so that we don't show the filesystem of the individual dev who ran the command . `./invoke.sh lock` (no upgrade) * fix lint issues * Fix another lint issue * Fixed merge conflicrts in README.md * Fixed fomatting in README.md * fix: revert breaking changes from merging main into 1.3 (#92) * revert a few files to last good 1.3 pr state, and manually unpick the deliberate changes. All tests passing locally * remove file that was deliberately deleted in 1.3 but revived by the merge * fix merge issues * linting * Added a new script to empty the Agent Builder datastore in a given GCP project. * minor changes to the reset_datastore.sh script. * fix lint issues on new feat * one file that should be deleted --------- Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> * feat: added bq table for form values (#98) * added bq table for form values * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * refactor: Align to python standards (#91) * added requirements.in to the doc-classifier job, and added it to the requirements_all.in fixed paths in pyproject.toml (root) fixed an issue in the build.tf * Linting * re-added the user-agent * fix: revert breaking changes from merging main into 1.3 (#92) * revert a few files to last good 1.3 pr state, and manually unpick the deliberate changes. All tests passing locally * remove file that was deliberately deleted in 1.3 but revived by the merge * fix merge issues * linting * Added a new script to empty the Agent Builder datastore in a given GCP project. * minor changes to the reset_datastore.sh script. * fix lint issues on new feat * one file that should be deleted --------- Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> * feat: added bq table for form values (#98) * added bq table for form values * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * Linting * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: eeaton <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> * feat: Data store layout parser (#102) * add file type specific configuration for data store with layout parser * linter * make layout parser the default parser * ci: e2e deployment integration test (#109) (#118) * CI: integration tests (#61) * squashed commit from forked testing environment. Cloud Build e2e test runs successfully, but some LINT and cleanup issues still to address. --------- * misc cleanup leftover from debug and testing * extend the wait_for_alloydb_ready_state timer from 300s to 600s. 300s usually works, but last run did have a flaky error about waiting for instance primary state * Improve image for ci: install gcloud and terratest on a single container, use that container image for each step of e2e test, remove extra steps to download gcloud library within the terraform-google-gcloud module, fix the debug disabling of tf dependencies in pre_tf_setup * fix syntax for referencing image digest * rebuild image with recent gcloud. Some commands that we expect were not not available on older version. * fix lint issues for CHECKOV and DOCKERFILE_HADOLINT for the builder image * working around lint issues for CKV_DOCKER_3. The builder image is expected to run as a highly privileged account, so disable scanning the image and hitting issues with CKV_DOCKER_3 * Narrow the checkov scope to just terraform * cleanup unused parts of go files * replace hardcoded value "deployer@" for cloud_build_service_account_email, and ensure it is used consistently across modules * cleanup common.sh. 1: "yes | gcloud auth" has some different behavior across Cloud Shell vs local terminal, so it's safer to prompt the user and get a consistent result. 2: update logic to not change ADC if a service account is already in use, but give a warning to set ADC separately * Specifying a regional bucket for build logs in previous commit means that the Build service account needs permission to write to that bucket. Errors like "(gcloud.builds.submit) FAILED_PRECONDITION: invalid bucket 31272142496-us-central1-cloudbuild-logs; service account [email protected] does not have access to the bucket" Product docs state that Storage Admin is necessary https://cloud.google.com/build/docs/securing-builds/store-manage-build-logs#store_build_logs_in_a_user-owned_and_regionalized_bucket * Identified the missing role as Storage Admin, not Storage Object Admin. Unclear why only gcloud_build_processing fails with an IAM error but other 2 gcloud modules are successful. (My best guess is because cloud build tries to set labels on the bucket when using the pack command) * chore: remove build_container_image.sh (#124) * initial commit: remove build_container_image.sh and refactor to create it in terraform, similar to other instances of module "gcloud-*" * lint issues, missing variable definition in module * Identified the root cause: tf plan fails with the gcloud submodule, if the calling module has a depends_on block. So I can fix this issue, but now I need to find a workaround to wait for AlloyDb ready state, which is not directly tied to completing resource creation in terraform. * workaround to force an explicit dependency between modules without the depends_on block. see https://github.com/kingman/tf-dont-do-depends-on-module-demo/blob/main/demo-flow/README.md * Last changed fixed the alloydb timing issue. Now module form_parser_processor also missing a dependency on the bq dataset created in module.common_infra, so needs to pass variables between modules to ensure the implicit dependency * Merge main to rc1.3 and fixed outstanding issues. * fix linting issues * Rewrite the VPC-related instructions in readme.md that got clobbered when merging divergent branches. * feat: doc registry svc (#133) * new doc registry service * handle sub-folder under bucket * Encapsulate GCS folder access into object * re-use RegistryDocuments in GCSFolder object * call doc registry service to detect duplicates from workflow * add update document registry step to the workflow * linting * linting * linting * linting * fix mypy * linting * fix: remove dependencies on default compute service account (#135) * remove and test references to default compute service account. Disable the account while int test is running to validate if there are other unknown dependencies. * 1. identified recent new feat to doc_registry that relies on implicit default service account. 2. Explicitly disable the service account in code to prevent future changes from creeping in * fix: Security Command Center SHA vulnerabilities (#126) * add dns logging, subnet flow log, ssl policy * fix pubsub roles, apply ssl policy to the target proxy resource * improve vpc best practices: configure network firewall policy and rules for explicit egress, use restricted VIP * DNS: add private managed zones for CNAME to use VIP,tivity * Composer needed explicit firewall rules. define the composer subnets as default variables so that the same value can be referenced in multiple places for firewall rules etc. https://cloud.google.com/composer/docs/composer-2/configure-vpc-sc#connectivity-restricted * feat: add persona roles and documentation (#134) * first draft of documentation and scripts, more validation needed of specific roles * improve the script logic to more flexibly handle setting some but not all of the persona, and include deployer in the same logic in case customers want to set it independently * add logic to apply the READER's storage object viewer permission at the specific bucket used by web app, not to view all buckets in the project * chore: workflow improvement (#127) * increase timeout for data store import and skipping classifier when no pdf * linting * feat: Composer performance tuning (#136) * Added variables in terraform for performance tuning of the Composer instance. * fix lint --------- Co-authored-by: Elliot <[email protected]> * perf: Read classifier result optimize (#142) * using partial read and oo approach * liniting * switch to use the new function from workflow * remove old function * liniting * feat: added specialized parser job with alloydb and bigquery (#144) * Initial commit for new component * Added the invoice parser to the main deployment * fix: revert breaking changes from merging main into 1.3 (#92) * revert a few files to last good 1.3 pr state, and manually unpick the deliberate changes. All tests passing locally * remove file that was deliberately deleted in 1.3 but revived by the merge * fix merge issues * linting * Added a new script to empty the Agent Builder datastore in a given GCP project. * minor changes to the reset_datastore.sh script. * fix lint issues on new feat * one file that should be deleted --------- Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> * feat: added bq table for form values (#98) * added bq table for form values * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * refactor: Align to python standards (#91) * added requirements.in to the doc-classifier job, and added it to the requirements_all.in fixed paths in pyproject.toml (root) fixed an issue in the build.tf * Linting * re-added the user-agent * fix: revert breaking changes from merging main into 1.3 (#92) * revert a few files to last good 1.3 pr state, and manually unpick the deliberate changes. All tests passing locally * remove file that was deliberately deleted in 1.3 but revived by the merge * fix merge issues * linting * Added a new script to empty the Agent Builder datastore in a given GCP project. * minor changes to the reset_datastore.sh script. * fix lint issues on new feat * one file that should be deleted --------- Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> * feat: added bq table for form values (#98) * added bq table for form values * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * Linting * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: eeaton <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> * Initial commit for new component * Linting * renamed due to name clash * Linting * WIP - need to figure out how to pass override env vars to my container, when there is a sidecar Need to make sure the auth-proxy is working * WIP - need to figure out how to pass override env vars to my container, when there is a sidecar Need to make sure the auth-proxy is working * WIP - network connectivity is done - authorization needs fixing * WIP - network connectivity is done - authorization needs fixing * Container is running to success! * Initial commit for new component * Added the invoice parser to the main deployment * refactor: Align to python standards (#91) * added requirements.in to the doc-classifier job, and added it to the requirements_all.in fixed paths in pyproject.toml (root) fixed an issue in the build.tf * Linting * re-added the user-agent * fix: revert breaking changes from merging main into 1.3 (#92) * revert a few files to last good 1.3 pr state, and manually unpick the deliberate changes. All tests passing locally * remove file that was deliberately deleted in 1.3 but revived by the merge * fix merge issues * linting * Added a new script to empty the Agent Builder datastore in a given GCP project. * minor changes to the reset_datastore.sh script. * fix lint issues on new feat * one file that should be deleted --------- Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> * feat: added bq table for form values (#98) * added bq table for form values * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * Linting * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: eeaton <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> * Initial commit for new component * Linting * renamed due to name clash * Linting * WIP - need to figure out how to pass override env vars to my container, when there is a sidecar Need to make sure the auth-proxy is working * WIP - need to figure out how to pass override env vars to my container, when there is a sidecar Need to make sure the auth-proxy is working * WIP - network connectivity is done - authorization needs fixing * WIP - network connectivity is done - authorization needs fixing * Container is running to success! * some fixes for vpc after rebasing * some fixes to terraform + create subnet for cloud run jobs * added some readme notes * Initial commit for new component * Added the invoice parser to the main deployment * refactor: Align to python standards (#91) * added requirements.in to the doc-classifier job, and added it to the requirements_all.in fixed paths in pyproject.toml (root) fixed an issue in the build.tf * Linting * re-added the user-agent * fix: revert breaking changes from merging main into 1.3 (#92) * revert a few files to last good 1.3 pr state, and manually unpick the deliberate changes. All tests passing locally * remove file that was deliberately deleted in 1.3 but revived by the merge * fix merge issues * linting * Added a new script to empty the Agent Builder datastore in a given GCP project. * minor changes to the reset_datastore.sh script. * fix lint issues on new feat * one file that should be deleted --------- Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> * feat: added bq table for form values (#98) * added bq table for form values * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * Linting * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: eeaton <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> * Initial commit for new component * Linting * renamed due to name clash * Linting * WIP - need to figure out how to pass override env vars to my container, when there is a sidecar Need to make sure the auth-proxy is working * WIP - need to figure out how to pass override env vars to my container, when there is a sidecar Need to make sure the auth-proxy is working * WIP - network connectivity is done - authorization needs fixing * WIP - network connectivity is done - authorization needs fixing * Container is running to success! * Initial commit for new component * Added the invoice parser to the main deployment * fix: revert breaking changes from merging main into 1.3 (#92) * revert a few files to last good 1.3 pr state, and manually unpick the deliberate changes. All tests passing locally * remove file that was deliberately deleted in 1.3 but revived by the merge * fix merge issues * linting * Added a new script to empty the Agent Builder datastore in a given GCP project. * minor changes to the reset_datastore.sh script. * fix lint issues on new feat * one file that should be deleted --------- Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> * feat: added bq table for form values (#98) * added bq table for form values * Linting --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * Initial commit for new component * Linting * renamed due to name clash * Linting * WIP - need to figure out how to pass override env vars to my container, when there is a sidecar Need to make sure the auth-proxy is working * WIP - network connectivity is done - authorization needs fixing * some fixes for vpc after rebasing * some fixes to terraform + create subnet for cloud run jobs * added some readme notes * fixed call to new function for move classified documents - was a problem during merge * fixed call to new function for move classified documents - was a problem during merge * Linting * Linting * add build service account * Subnet was missing a flag + Linting * More Linting * More Linting * Changed name to avoid name collision * resolve some comments * resolve some comments * Linting * pass sleep timer into specializer-parser module. specialized_parser_user cannot be created with the implicit dependency graph, because alloydb isn't actually available for several minutes after it tells terraform that the resource is created. * Read environ from main entry point instead of class * empty commit to trigger build after flaky test failure --------- Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: eeaton <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Dharmesh Patel <[email protected]> Co-authored-by: Charlie Wang <[email protected]> * feat: Vpc condtions (#143) * choose the right network based on create network parameter * condiontional policy creation * linting * cleanup the Readme directions, and remove the unused "/test-vpc" directory * uses conditional local variable to reference network attributes --------- Co-authored-by: Elliot <[email protected]> * chore: Separate alloydb config and superuser (#146) * refactor subnet for vpc serverless access to part of common-infra/. Create a Cloud Run job to connect to db and setup initial schema and permissions, so that subsequent Cloud Run jobs like specialized-parser don't require superuser * add a trigger through gcloud command to automate the dbconfig Cloud Run job as part of the initial terraform apply * more concise code to run through users and give them permissions... * add a post-setup-config module. move the alloydb config commands to the post-setup-config. revert the changes to specialized-parser user so that it remains in the specializer-parser/ module, and pass that output to post-setup-config --------- Co-authored-by: Eyal Ben Ivri <[email protected]> * feat: additional IAM roles for deployer sa (#147) * additional IAM roles for deployer sa * fix typo * retrigger tests with changes to explicitly declare the deployer account to be used with terraform steps. GOOGLE_IMPERSONATE_SERVICE_ACCOUNT is used for adc with terraform, auth/impersonate_service_account is used for the local-exec steps running with gcloud * fix syntax for cloud build variable substitution * fix syntax for cloud build variable substitution * fix linebreak syntax in terratest args * fix syntax for `gcloud config set` family of commands --------- Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: anuradha-bajpai-google <[email protected]> Co-authored-by: Mark Scannell <[email protected]> Co-authored-by: Eyal Ben Ivri <[email protected]> Co-authored-by: Elliot <[email protected]> Co-authored-by: Charlie Wang <[email protected]> Co-authored-by: Marco Ferrari <[email protected]> Co-authored-by: Charlie Wang <[email protected]>

anuradha-bajpai-google and others added 5 commits August 5, 2024 06:24

DocAI form parser processor integration

73a1a29

form processor build conatiner image script

d9a0a18

DocAI form parser code integration

b27ef91

DocAI Form Parser fixes

6d2cf4f

Changes:

6022f7f

- Re-sync'd the development constraints (shared) based on the form parser requirements.in - Moved requirements.txt to requirements.in for form parser - Updated tasks.py to also generate requirements.txt from requirements.in - Reformatted terraform from pre-commit

dharmez requested review from mescanne and dharmez and removed request for mescanne August 8, 2024 17:21

anuradha-bajpai-google requested a review from mescanne August 8, 2024 17:24

dharmez changed the base branch from main to release1_2 August 8, 2024 18:49

dharmez merged commit ab2d8cd into release1_2 Aug 8, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocAI Form Parser microservice #12

DocAI Form Parser microservice #12

anuradha-bajpai-google commented Aug 7, 2024

DocAI Form Parser microservice #12

DocAI Form Parser microservice #12

Conversation

anuradha-bajpai-google commented Aug 7, 2024