Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 26 additions & 4 deletions lambda/disaster_recovery/lambda_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from decimal import Decimal
import logging
from datetime import datetime
import base64
import re

# Set up logging
Expand Down Expand Up @@ -279,6 +280,24 @@ def get_export_data_files(s3_client, s3_bucket, export_info):
logger.error(f"Failed to get export data files for {export_info.get('table_name', 'unknown')}: {str(e)}")
raise

def decode_binary(item):
if isinstance(item, dict):
if len(item) == 1:
key, value = next(iter(item.items()))
if key == 'B':
try:
return base64.b64decode(value)
except Exception as e:
logger.warning(f"Failed to decode binary attribute: {e}")
return value
return item
Comment on lines +292 to +293
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are inconsistent. return value will return a string which is the base64-decoded equivalent of the B attribute value. return item will return the original dict. The calling code is not equipped to handle this difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thanks @briskt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this, @Praveenraj-K.

Along the lines of what @briskt said, I think we will want to ensure the base64-decoded data is assigned back to the value of that B key in the structure being decoded, since DynamoDB will be expecting it there, so I think returning item is the desired behavior (once we've decoded the value in the item).

Some automated tests will probably help clarify that little details like that are all correct, as @briskt said. I don't know Python well enough to ensure we're handling things like data types properly in this code (e.g. base64.b64decode() apparently returns an array of bytes... does that need to be converted to a string using something like decode('utf-8')? I don't know.)

else:
return {k: decode_binary(v) for k, v in item.items()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this recursive adds unnecessary complexity since we know what the incoming format is. Speaking of which, should we make this code fully general-purpose, not assuming any particular data schema? If so, this code will need to handle M and L types appropriately. If not, the code can probably be simplified a bit. @forevermatt what would you recommend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, i also thought of handling multiple attributes type like BS set, M, and L. But decided to go simple for now as we have issue only with the Binary types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briskt If we can do so easily enough, I think it's worth making this general purpose, so we can use it for any DynamoDB databases we have (now or in the future), like our mysql-backup-restore and postresql-backup-restore tools).

However, since we don't currently use Map, List, or BinarySet data types, would could potentially just return an error / throw an exception if those types are encountered, so that we don't need to add support for them now (when we don't need them) but to make it clear that the restore does not work for those kinds of values.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I certainly wouldn't want any one of us to write code that will likely never be utilized. Throwing an exception for an unsupported data type seems like a good way to handle it.
I'll throw this nag in here once more. This feels like reinventing the wheel since AWS does have restore capability ready for use. I know it would require jumping through different hoops with the Terraform state, but it feels more appropriate for our use case.

Copy link
Contributor

@forevermatt forevermatt Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briskt I'm beginning to think you're right about that: it may be worth revisiting the decision to write the restore code ourselves instead of using AWS's ability to do that for us.

@devon-sil or @Praveenraj-K, thoughts? I think the downside was that it requires manually deleting the original DynamoDB table (in both regions) before we could do a "Restore from S3" to (re-)create the restored table with the correct name, then set up the replica (i.e. to make it a global table). I might be forgetting something, though.

Many of my notes from experimenting with various options are here: https://support.gtis.sil.org/issue/IDP-1262

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this PR be closed or revised? @devon-sil what are your thoughts?

elif isinstance(item, list):
return [decode_binary(v) for v in item]
else:
return item


def parse_dynamodb_json_file(s3_client, s3_bucket, s3_key):
"""Parse a single DynamoDB JSON export file from S3"""
Expand All @@ -303,13 +322,16 @@ def parse_dynamodb_json_file(s3_client, s3_bucket, s3_key):
try:
item_data = json.loads(line)
if 'Item' in item_data:
items.append(item_data['Item'])
item = item_data['Item']
elif isinstance(item_data, dict):
# Handle case where the line is already the item
items.append(item_data)
item = item_data
else:
continue
item = decode_binary(item)
items.append(item)
except json.JSONDecodeError as e:
error_count += 1
if error_count <= 5: # Log first 5 errors only
if error_count <= 5:
logger.warning(f"JSON decode error on line {line_count}: {str(e)}")

if error_count > 0:
Expand Down