-
-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Browsertrix Version
v1.9.2 and earlier (likely long-standing bug)
What did you expect to happen? What happened instead?
I expect that when browser profile tar.gz files are saved into s3, that the filepath stored in the database is correct, and that replication and deletion of that file works as expected.
Instead, theresource.filename
s stored in the database take the form profiles/<TARBALL>
, whereas they are stored in s3 as <org uuid>/profiles/<TARBALL>
. As a result, operations on these files such as replicating and deleting the tarballs fail silently, as there's no file to perform the operation on.
Other resources such as crawl files do not share this issue, and have their full path from the root of the s3 bucket stored in filename
.
This issue also does not affect saving and applying browser profiles to crawls.
To fix this moving forward, we should:
- Make sure profiles moving forward have the correct and full filepath stored in
resource.filename
- Add a migration to fix the filepaths for all existing browser profiles
- See if we can configure the rclone replication jobs to fail with a non-0 exit code if the source file is not found
In addition, once the patch is applied and migration is run, we'll want to do the following remediation in our hosted service (and provide instructions to self-hosting users for how to do the same on their end):
- After the filepaths are fixed, re-replicate all existing profiles
- Delete any tarballs in s3 that correspond to since-deleted browser profiles
Reproduction instructions
- Create a browser profile
- Compare the filepath in the profile's
resource
against where it's actually stored in the s3 bucket - Validate that the profile is not copied to any configured replica storage locations
- Delete the profile and validate that the file in s3 is not deleted
Screenshots / Video
No response
Environment
No response
Additional details
No response
Metadata
Metadata
Labels
Type
Projects
Status