-
Notifications
You must be signed in to change notification settings - Fork 0
Feature binaryfix #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9f7e7de
ee49bce
87e9837
c7672c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,6 +7,7 @@ | |
| from decimal import Decimal | ||
| import logging | ||
| from datetime import datetime | ||
| import base64 | ||
| import re | ||
|
|
||
| # Set up logging | ||
|
|
@@ -279,6 +280,24 @@ def get_export_data_files(s3_client, s3_bucket, export_info): | |
| logger.error(f"Failed to get export data files for {export_info.get('table_name', 'unknown')}: {str(e)}") | ||
| raise | ||
|
|
||
| def decode_binary(item): | ||
| if isinstance(item, dict): | ||
| if len(item) == 1: | ||
| key, value = next(iter(item.items())) | ||
| if key == 'B': | ||
| try: | ||
| return base64.b64decode(value) | ||
| except Exception as e: | ||
| logger.warning(f"Failed to decode binary attribute: {e}") | ||
| return value | ||
| return item | ||
| else: | ||
| return {k: decode_binary(v) for k, v in item.items()} | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Making this recursive adds unnecessary complexity since we know what the incoming format is. Speaking of which, should we make this code fully general-purpose, not assuming any particular data schema? If so, this code will need to handle M and L types appropriately. If not, the code can probably be simplified a bit. @forevermatt what would you recommend? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At first, i also thought of handling multiple attributes type like BS set, M, and L. But decided to go simple for now as we have issue only with the Binary types. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @briskt If we can do so easily enough, I think it's worth making this general purpose, so we can use it for any DynamoDB databases we have (now or in the future), like our mysql-backup-restore and postresql-backup-restore tools). However, since we don't currently use Map, List, or BinarySet data types, would could potentially just return an error / throw an exception if those types are encountered, so that we don't need to add support for them now (when we don't need them) but to make it clear that the restore does not work for those kinds of values. Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense. I certainly wouldn't want any one of us to write code that will likely never be utilized. Throwing an exception for an unsupported data type seems like a good way to handle it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @briskt I'm beginning to think you're right about that: it may be worth revisiting the decision to write the restore code ourselves instead of using AWS's ability to do that for us. @devon-sil or @Praveenraj-K, thoughts? I think the downside was that it requires manually deleting the original DynamoDB table (in both regions) before we could do a "Restore from S3" to (re-)create the restored table with the correct name, then set up the replica (i.e. to make it a global table). I might be forgetting something, though. Many of my notes from experimenting with various options are here: https://support.gtis.sil.org/issue/IDP-1262 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this PR be closed or revised? @devon-sil what are your thoughts? |
||
| elif isinstance(item, list): | ||
| return [decode_binary(v) for v in item] | ||
| else: | ||
| return item | ||
|
|
||
|
|
||
| def parse_dynamodb_json_file(s3_client, s3_bucket, s3_key): | ||
| """Parse a single DynamoDB JSON export file from S3""" | ||
|
|
@@ -303,13 +322,16 @@ def parse_dynamodb_json_file(s3_client, s3_bucket, s3_key): | |
| try: | ||
| item_data = json.loads(line) | ||
| if 'Item' in item_data: | ||
| items.append(item_data['Item']) | ||
| item = item_data['Item'] | ||
| elif isinstance(item_data, dict): | ||
| # Handle case where the line is already the item | ||
| items.append(item_data) | ||
| item = item_data | ||
| else: | ||
| continue | ||
| item = decode_binary(item) | ||
| items.append(item) | ||
| except json.JSONDecodeError as e: | ||
| error_count += 1 | ||
| if error_count <= 5: # Log first 5 errors only | ||
| if error_count <= 5: | ||
| logger.warning(f"JSON decode error on line {line_count}: {str(e)}") | ||
|
|
||
| if error_count > 0: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines are inconsistent.
return valuewill return a string which is the base64-decoded equivalent of the B attribute value.return itemwill return the original dict. The calling code is not equipped to handle this difference.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, thanks @briskt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this, @Praveenraj-K.
Along the lines of what @briskt said, I think we will want to ensure the base64-decoded data is assigned back to the value of that
Bkey in the structure being decoded, since DynamoDB will be expecting it there, so I think returningitemis the desired behavior (once we've decoded the value in the item).Some automated tests will probably help clarify that little details like that are all correct, as @briskt said. I don't know Python well enough to ensure we're handling things like data types properly in this code (e.g.
base64.b64decode()apparently returns an array of bytes... does that need to be converted to a string using something likedecode('utf-8')? I don't know.)