You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.
Temporal workflows is a bit new to our stack and neither of us have extensive hands-on knowledge with it;
General avoidance of scope creep;
The fact that merging feeds into each other is useful to us alone;
I'd suggest that we implement the feed merging first, run it with inputs from feed_actions, see what happens, correct course, and then move on to merging the media sources as a separate task.
As per #799 (comment), the final database of what gets merged into what is:
and for this task we'll need just the feed_actions table.
Outline
The workflow to merge one feed (<src_feeds_id>) into another (<dst_feeds_id>) will look as follows (adapted from #799 (comment)):
Move all rows that reference the feeds table with <src_feeds_id>:
Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the downloads table
Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_stories_map table, taking into account that there could be duplicates
Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the scraped_feeds table, taking into account that there could be duplicates
Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_from_yesterday table, taking into account that there could be duplicates
Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_tags_map table, taking into account that there could be duplicates
Remove the row with feeds_id = <src_feeds_id> from the feeds table:
Remove rows with feeds_id = <src_feeds_id> from the downloads table - there shouldn't be any left as we've just merged them
Remove rows with feeds_id = <src_feeds_id> from the feeds_stories_map table - there shouldn't be any left as we've just merged them
<...>
Remove rows with feeds_id = <src_feeds_id> from the feeds_tags_map table - there shouldn't be any left as we've just merged them
Here's how it looks like on production (feel free to SSH in and look around yourself):
$ ssh woodward
woodward$ docker exec -it $(docker ps | grep postgresql-server | cut -d ' ' -f1) psql
psql# \d+ feeds
Table "public.feeds"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------------------------+--------------------------+-----------+----------+-----------------------------------------+----------+--------------+-------------
feeds_id | integer | | not null | nextval('feeds_feeds_id_seq'::regclass) | plain | |
media_id | integer | | not null | | plain | |
name | character varying(512) | | not null | | extended | |
url | character varying(1024) | | not null | | extended | |
last_attempted_download_time | timestamp with time zone | | | | plain | |
type | feed_type | | not null | 'syndicated'::feed_type | plain | |
last_new_story_time | timestamp with time zone | | | | plain | |
last_checksum | text | | | | extended | |
last_successful_download_time | timestamp with time zone | | | | plain | |
active | boolean | | not null | true | plain | |
Indexes:
"feeds_pkey" PRIMARY KEY, btree (feeds_id)
"feeds_last_attempted_download_time" btree (last_attempted_download_time)
"feeds_last_successful_download_time" btree (last_successful_download_time)
"feeds_media" btree (media_id)
"feeds_name" btree (name)
Foreign-key constraints:
"feeds_media_id_fkey" FOREIGN KEY (media_id) REFERENCES media(media_id) ON DELETE CASCADE
Referenced by:
TABLE "downloads" CONSTRAINT "downloads_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id)
TABLE "feeds_stories_map_p_00" CONSTRAINT "feeds_stories_map_p_00_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_01" CONSTRAINT "feeds_stories_map_p_01_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_02" CONSTRAINT "feeds_stories_map_p_02_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_03" CONSTRAINT "feeds_stories_map_p_03_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_04" CONSTRAINT "feeds_stories_map_p_04_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_05" CONSTRAINT "feeds_stories_map_p_05_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_06" CONSTRAINT "feeds_stories_map_p_06_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_07" CONSTRAINT "feeds_stories_map_p_07_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_08" CONSTRAINT "feeds_stories_map_p_08_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_09" CONSTRAINT "feeds_stories_map_p_09_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_10" CONSTRAINT "feeds_stories_map_p_10_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_11" CONSTRAINT "feeds_stories_map_p_11_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_12" CONSTRAINT "feeds_stories_map_p_12_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_13" CONSTRAINT "feeds_stories_map_p_13_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_14" CONSTRAINT "feeds_stories_map_p_14_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_15" CONSTRAINT "feeds_stories_map_p_15_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_16" CONSTRAINT "feeds_stories_map_p_16_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_17" CONSTRAINT "feeds_stories_map_p_17_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_18" CONSTRAINT "feeds_stories_map_p_18_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_19" CONSTRAINT "feeds_stories_map_p_19_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_20" CONSTRAINT "feeds_stories_map_p_20_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_21" CONSTRAINT "feeds_stories_map_p_21_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_tags_map" CONSTRAINT "feeds_tags_map_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) ON DELETE CASCADE
TABLE "scraped_feeds" CONSTRAINT "scraped_feeds_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) ON DELETE CASCADE
Access method: heap
So, a bunch of non-partitioned and partitioned tables reference feeds.feeds_id. One table that's missing from this list is feeds_from_yesterday - I remember that the lack of reference is deliberate, I just don't remember why :)
downloads is partitioned by state column and then further by type, so, for example, a download with state = 'success' and type = 'feed' would end up in one of the downloads_success_feed tables.
For your purposes I think you can pretty much ignore the fact that it's partitioned and just UPDATE the base downloads table directly.
Tips, tricks, notes and other things that came to mind
Bigger tables might have thousands if not hundreds of thousands of rows that reference feeds.feeds_id so you'll need to chunk your UPDATEs somehow. One way to do this is do get MIN(primary_key) and MAX(primary_key) from every referencing table with a specific feeds_id (make sure that an index exists that would allow you to do this in a timely manner!), and then UPDATE the referencing table in chunks based on primary_key.
Given the "do this, if that succeeds then do that, then ..., and make sure that it all works for thousands of inputs, and you better track progress of all of it, oh, and external components to be updated might go down at any point, and also it's unclear whether individual steps to be executed will work with production's amount of data" nature of this task, I think this is a good chance to try out Temporal. You can use my podcast ingest as a reference:
IMHO this is one of these tasks that become easier if you write yourself a good test that confirms that your code is doing what you want it to do exactly. So, make sure to write at least one good test that preloads a testing database with some mock duplicate feeds (to both feeds and other referencing tables), runs the workflow and makes sure that feeds got merged and nothing got lost in the process. The test should also test out UPDATEs in chunks and other border cases that come to mind.
As always, complain loudly and early if something's unclear!
The text was updated successfully, but these errors were encountered:
As for the workflow-as-in-the-how-do-we-code-this-thing-together-workflow, whenever you come up with at least the pre-barebones / scaffolding of what would eventually end up being the feed merging workflow, could you submit a PR and then we'll iteratively stare at it and work it out together?
Or do you have any other ideas on how we could go about implementing this?
So, now that we came up with lists of media sources / feeds to be merged into each other (#799), let's try doing the actual merging.
Given that:
I'd suggest that we implement the feed merging first, run it with inputs from
feed_actions
, see what happens, correct course, and then move on to merging the media sources as a separate task.As per #799 (comment), the final database of what gets merged into what is:
https://drive.google.com/file/d/1sfQLMwq5OkooDtg3ZjYOTOyNEIzMv2HZ/view?usp=sharing
and for this task we'll need just the
feed_actions
table.Outline
The workflow to merge one feed (
<src_feeds_id>
) into another (<dst_feeds_id>
) will look as follows (adapted from #799 (comment)):feeds
table with<src_feeds_id>
:feeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thedownloads
tablefeeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thefeeds_stories_map
table, taking into account that there could be duplicatesfeeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thescraped_feeds
table, taking into account that there could be duplicatesfeeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thefeeds_from_yesterday
table, taking into account that there could be duplicatesfeeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thefeeds_tags_map
table, taking into account that there could be duplicatesfeeds_id = <src_feeds_id>
from thefeeds
table:feeds_id = <src_feeds_id>
from thedownloads
table - there shouldn't be any left as we've just merged themfeeds_id = <src_feeds_id>
from thefeeds_stories_map
table - there shouldn't be any left as we've just merged themfeeds_id = <src_feeds_id>
from thefeeds_tags_map
table - there shouldn't be any left as we've just merged themfeeds
.Referencing tables
Merging feeds is a bit easier than merging media sources because the feed information don't end up on Solr (so we don't have to update its index in any way), there aren't that many tables that reference rows in
feeds
(open up https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/postgresql-server/pgmigrate/migrations/V0001__initial_schema.sql and Command+F forfeeds_id
).Here's how it looks like on production (feel free to SSH in and look around yourself):
So, a bunch of non-partitioned and partitioned tables reference
feeds.feeds_id
. One table that's missing from this list isfeeds_from_yesterday
- I remember that the lack of reference is deliberate, I just don't remember why :)Here are the sizes of all of these tables:
downloads
is partitioned bystate
column and then further bytype
, so, for example, a download withstate = 'success'
andtype = 'feed'
would end up in one of thedownloads_success_feed
tables.For your purposes I think you can pretty much ignore the fact that it's partitioned and just
UPDATE
the basedownloads
table directly.Another partitioned table is
feeds_stories_map
:And then there are a few smaller tables:
Tips, tricks, notes and other things that came to mind
Bigger tables might have thousands if not hundreds of thousands of rows that reference
feeds.feeds_id
so you'll need to chunk yourUPDATE
s somehow. One way to do this is do getMIN(primary_key)
andMAX(primary_key)
from every referencing table with a specificfeeds_id
(make sure that an index exists that would allow you to do this in a timely manner!), and thenUPDATE
the referencing table in chunks based onprimary_key
.Given the "do this, if that succeeds then do that, then ..., and make sure that it all works for thousands of inputs, and you better track progress of all of it, oh, and external components to be updated might go down at any point, and also it's unclear whether individual steps to be executed will work with production's amount of data" nature of this task, I think this is a good chance to try out Temporal. You can use my podcast ingest as a reference:
backend/apps/podcast-transcribe-episode/tests/python/test_workflow.py
Lines 80 to 183 in f0c523e
IMHO this is one of these tasks that become easier if you write yourself a good test that confirms that your code is doing what you want it to do exactly. So, make sure to write at least one good test that preloads a testing database with some mock duplicate feeds (to both
feeds
and other referencing tables), runs the workflow and makes sure that feeds got merged and nothing got lost in the process. The test should also test outUPDATE
s in chunks and other border cases that come to mind.As always, complain loudly and early if something's unclear!
The text was updated successfully, but these errors were encountered: