Implement workflow to merge feeds into each other #816
Description
So, now that we came up with lists of media sources / feeds to be merged into each other (#799), let's try doing the actual merging.
Given that:
- We need to merge some individual feeds into each other, and also merge some media sources into each other (the process of which includes merging feeds), as I've mentioned in Come up with a final "what media source gets merged into what / deleted" list #799 (comment);
- Temporal workflows is a bit new to our stack and neither of us have extensive hands-on knowledge with it;
- General avoidance of scope creep;
- The fact that merging feeds into each other is useful to us alone;
I'd suggest that we implement the feed merging first, run it with inputs from feed_actions
, see what happens, correct course, and then move on to merging the media sources as a separate task.
As per #799 (comment), the final database of what gets merged into what is:
https://drive.google.com/file/d/1sfQLMwq5OkooDtg3ZjYOTOyNEIzMv2HZ/view?usp=sharing
and for this task we'll need just the feed_actions
table.
Outline
The workflow to merge one feed (<src_feeds_id>
) into another (<dst_feeds_id>
) will look as follows (adapted from #799 (comment)):
- Move all rows that reference the
feeds
table with<src_feeds_id>
:- Set
feeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thedownloads
table - Set
feeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thefeeds_stories_map
table, taking into account that there could be duplicates - Set
feeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thescraped_feeds
table, taking into account that there could be duplicates - Set
feeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thefeeds_from_yesterday
table, taking into account that there could be duplicates - Set
feeds_id = <dst_feeds_id>
on rows withfeeds_id = <src_feeds_id>
in thefeeds_tags_map
table, taking into account that there could be duplicates
- Set
- Remove the row with
feeds_id = <src_feeds_id>
from thefeeds
table:- Remove rows with
feeds_id = <src_feeds_id>
from thedownloads
table - there shouldn't be any left as we've just merged them - Remove rows with
feeds_id = <src_feeds_id>
from thefeeds_stories_map
table - there shouldn't be any left as we've just merged them - <...>
- Remove rows with
feeds_id = <src_feeds_id>
from thefeeds_tags_map
table - there shouldn't be any left as we've just merged them - Remove the actual row from
feeds
.
- Remove rows with
Referencing tables
Merging feeds is a bit easier than merging media sources because the feed information don't end up on Solr (so we don't have to update its index in any way), there aren't that many tables that reference rows in feeds
(open up https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/postgresql-server/pgmigrate/migrations/V0001__initial_schema.sql and Command+F for feeds_id
).
Here's how it looks like on production (feel free to SSH in and look around yourself):
$ ssh woodward
woodward$ docker exec -it $(docker ps | grep postgresql-server | cut -d ' ' -f1) psql
psql# \d+ feeds
Table "public.feeds"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------------------------+--------------------------+-----------+----------+-----------------------------------------+----------+--------------+-------------
feeds_id | integer | | not null | nextval('feeds_feeds_id_seq'::regclass) | plain | |
media_id | integer | | not null | | plain | |
name | character varying(512) | | not null | | extended | |
url | character varying(1024) | | not null | | extended | |
last_attempted_download_time | timestamp with time zone | | | | plain | |
type | feed_type | | not null | 'syndicated'::feed_type | plain | |
last_new_story_time | timestamp with time zone | | | | plain | |
last_checksum | text | | | | extended | |
last_successful_download_time | timestamp with time zone | | | | plain | |
active | boolean | | not null | true | plain | |
Indexes:
"feeds_pkey" PRIMARY KEY, btree (feeds_id)
"feeds_last_attempted_download_time" btree (last_attempted_download_time)
"feeds_last_successful_download_time" btree (last_successful_download_time)
"feeds_media" btree (media_id)
"feeds_name" btree (name)
Foreign-key constraints:
"feeds_media_id_fkey" FOREIGN KEY (media_id) REFERENCES media(media_id) ON DELETE CASCADE
Referenced by:
TABLE "downloads" CONSTRAINT "downloads_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id)
TABLE "feeds_stories_map_p_00" CONSTRAINT "feeds_stories_map_p_00_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_01" CONSTRAINT "feeds_stories_map_p_01_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_02" CONSTRAINT "feeds_stories_map_p_02_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_03" CONSTRAINT "feeds_stories_map_p_03_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_04" CONSTRAINT "feeds_stories_map_p_04_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_05" CONSTRAINT "feeds_stories_map_p_05_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_06" CONSTRAINT "feeds_stories_map_p_06_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_07" CONSTRAINT "feeds_stories_map_p_07_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_08" CONSTRAINT "feeds_stories_map_p_08_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_09" CONSTRAINT "feeds_stories_map_p_09_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_10" CONSTRAINT "feeds_stories_map_p_10_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_11" CONSTRAINT "feeds_stories_map_p_11_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_12" CONSTRAINT "feeds_stories_map_p_12_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_13" CONSTRAINT "feeds_stories_map_p_13_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_14" CONSTRAINT "feeds_stories_map_p_14_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_15" CONSTRAINT "feeds_stories_map_p_15_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_16" CONSTRAINT "feeds_stories_map_p_16_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_17" CONSTRAINT "feeds_stories_map_p_17_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_18" CONSTRAINT "feeds_stories_map_p_18_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_19" CONSTRAINT "feeds_stories_map_p_19_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_20" CONSTRAINT "feeds_stories_map_p_20_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_stories_map_p_21" CONSTRAINT "feeds_stories_map_p_21_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
TABLE "feeds_tags_map" CONSTRAINT "feeds_tags_map_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) ON DELETE CASCADE
TABLE "scraped_feeds" CONSTRAINT "scraped_feeds_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) ON DELETE CASCADE
Access method: heap
So, a bunch of non-partitioned and partitioned tables reference feeds.feeds_id
. One table that's missing from this list is feeds_from_yesterday
- I remember that the lack of reference is deliberate, I just don't remember why :)
Here are the sizes of all of these tables:
psql# \d+
<...>
public | downloads | partitioned table | mediacloud | permanent | 0 bytes |
public | downloads_error | table | mediacloud | permanent | 34 GB |
public | downloads_feed_error | table | mediacloud | permanent | 5828 MB |
public | downloads_fetching | table | mediacloud | permanent | 11 MB |
public | downloads_in_past_day | view | mediacloud | permanent | 0 bytes |
public | downloads_media | view | mediacloud | permanent | 0 bytes |
public | downloads_non_media | view | mediacloud | permanent | 0 bytes |
public | downloads_pending | table | mediacloud | permanent | 11 MB |
public | downloads_success | partitioned table | mediacloud | permanent | 0 bytes |
public | downloads_success_content | partitioned table | mediacloud | permanent | 0 bytes |
public | downloads_success_content_00 | table | mediacloud | permanent | 5322 MB |
public | downloads_success_content_01 | table | mediacloud | permanent | 7182 MB |
public | downloads_success_content_02 | table | mediacloud | permanent | 11 GB |
public | downloads_success_content_03 | table | mediacloud | permanent | 12 GB |
public | downloads_success_content_04 | table | mediacloud | permanent | 10 GB |
public | downloads_success_content_05 | table | mediacloud | permanent | 10147 MB |
public | downloads_success_content_06 | table | mediacloud | permanent | 12 GB |
public | downloads_success_content_07 | table | mediacloud | permanent | 13 GB |
public | downloads_success_content_08 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_09 | table | mediacloud | permanent | 13 GB |
public | downloads_success_content_10 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_11 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_12 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_13 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_14 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_15 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_16 | table | mediacloud | permanent | 15 GB |
public | downloads_success_content_17 | table | mediacloud | permanent | 16 GB |
public | downloads_success_content_18 | table | mediacloud | permanent | 16 GB |
public | downloads_success_content_19 | table | mediacloud | permanent | 15 GB |
public | downloads_success_content_20 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_21 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_22 | table | mediacloud | permanent | 15 GB |
public | downloads_success_content_23 | table | mediacloud | permanent | 15 GB |
public | downloads_success_content_24 | table | mediacloud | permanent | 15 GB |
public | downloads_success_content_25 | table | mediacloud | permanent | 14 GB |
public | downloads_success_content_26 | table | mediacloud | permanent | 15 GB |
public | downloads_success_content_27 | table | mediacloud | permanent | 16 GB |
public | downloads_success_content_28 | table | mediacloud | permanent | 18 GB |
public | downloads_success_content_29 | table | mediacloud | permanent | 19 GB |
public | downloads_success_content_30 | table | mediacloud | permanent | 15 GB |
public | downloads_success_content_31 | table | mediacloud | permanent | 15 GB |
public | downloads_success_content_32 | table | mediacloud | permanent | 9981 MB |
public | downloads_success_content_33 | table | mediacloud | permanent | 8192 bytes |
public | downloads_success_feed | partitioned table | mediacloud | permanent | 0 bytes |
public | downloads_success_feed_00 | table | mediacloud | permanent | 8366 MB |
public | downloads_success_feed_01 | table | mediacloud | permanent | 5427 MB |
public | downloads_success_feed_02 | table | mediacloud | permanent | 7832 MB |
public | downloads_success_feed_03 | table | mediacloud | permanent | 5881 MB |
public | downloads_success_feed_04 | table | mediacloud | permanent | 6155 MB |
public | downloads_success_feed_05 | table | mediacloud | permanent | 6575 MB |
public | downloads_success_feed_06 | table | mediacloud | permanent | 5742 MB |
public | downloads_success_feed_07 | table | mediacloud | permanent | 5534 MB |
public | downloads_success_feed_08 | table | mediacloud | permanent | 4877 MB |
public | downloads_success_feed_09 | table | mediacloud | permanent | 5132 MB |
public | downloads_success_feed_10 | table | mediacloud | permanent | 5217 MB |
public | downloads_success_feed_11 | table | mediacloud | permanent | 5072 MB |
public | downloads_success_feed_12 | table | mediacloud | permanent | 5091 MB |
public | downloads_success_feed_13 | table | mediacloud | permanent | 5257 MB |
public | downloads_success_feed_14 | table | mediacloud | permanent | 5209 MB |
public | downloads_success_feed_15 | table | mediacloud | permanent | 5229 MB |
public | downloads_success_feed_16 | table | mediacloud | permanent | 3821 MB |
public | downloads_success_feed_17 | table | mediacloud | permanent | 3078 MB |
public | downloads_success_feed_18 | table | mediacloud | permanent | 3128 MB |
public | downloads_success_feed_19 | table | mediacloud | permanent | 4509 MB |
public | downloads_success_feed_20 | table | mediacloud | permanent | 5425 MB |
public | downloads_success_feed_21 | table | mediacloud | permanent | 5463 MB |
public | downloads_success_feed_22 | table | mediacloud | permanent | 5793 MB |
public | downloads_success_feed_23 | table | mediacloud | permanent | 5041 MB |
public | downloads_success_feed_24 | table | mediacloud | permanent | 5319 MB |
public | downloads_success_feed_25 | table | mediacloud | permanent | 5468 MB |
public | downloads_success_feed_26 | table | mediacloud | permanent | 5396 MB |
public | downloads_success_feed_27 | table | mediacloud | permanent | 5224 MB |
public | downloads_success_feed_28 | table | mediacloud | permanent | 5124 MB |
public | downloads_success_feed_29 | table | mediacloud | permanent | 4987 MB |
public | downloads_success_feed_30 | table | mediacloud | permanent | 5155 MB |
public | downloads_success_feed_31 | table | mediacloud | permanent | 5228 MB |
public | downloads_success_feed_32 | table | mediacloud | permanent | 3393 MB |
public | downloads_success_feed_33 | table | mediacloud | permanent | 8192 bytes |
downloads
is partitioned by state
column and then further by type
, so, for example, a download with state = 'success'
and type = 'feed'
would end up in one of the downloads_success_feed
tables.
For your purposes I think you can pretty much ignore the fact that it's partitioned and just UPDATE
the base downloads
table directly.
Another partitioned table is feeds_stories_map
:
public | feeds_stories_map | view | mediacloud | permanent | 0 bytes |
public | feeds_stories_map_p | table | mediacloud | permanent | 0 bytes |
public | feeds_stories_map_p_00 | table | mediacloud | permanent | 3148 MB |
public | feeds_stories_map_p_01 | table | mediacloud | permanent | 4078 MB |
public | feeds_stories_map_p_02 | table | mediacloud | permanent | 4079 MB |
public | feeds_stories_map_p_03 | table | mediacloud | permanent | 4206 MB |
public | feeds_stories_map_p_04 | table | mediacloud | permanent | 4208 MB |
public | feeds_stories_map_p_05 | table | mediacloud | permanent | 4219 MB |
public | feeds_stories_map_p_06 | table | mediacloud | permanent | 4223 MB |
public | feeds_stories_map_p_07 | table | mediacloud | permanent | 4220 MB |
public | feeds_stories_map_p_08 | table | mediacloud | permanent | 4217 MB |
public | feeds_stories_map_p_09 | table | mediacloud | permanent | 4221 MB |
public | feeds_stories_map_p_10 | table | mediacloud | permanent | 4227 MB |
public | feeds_stories_map_p_11 | table | mediacloud | permanent | 4211 MB |
public | feeds_stories_map_p_12 | table | mediacloud | permanent | 4173 MB |
public | feeds_stories_map_p_13 | table | mediacloud | permanent | 4424 MB |
public | feeds_stories_map_p_14 | table | mediacloud | permanent | 4293 MB |
public | feeds_stories_map_p_15 | table | mediacloud | permanent | 4241 MB |
public | feeds_stories_map_p_16 | table | mediacloud | permanent | 4251 MB |
public | feeds_stories_map_p_17 | table | mediacloud | permanent | 4210 MB |
public | feeds_stories_map_p_18 | table | mediacloud | permanent | 4215 MB |
public | feeds_stories_map_p_19 | table | mediacloud | permanent | 4282 MB |
public | feeds_stories_map_p_20 | table | mediacloud | permanent | 3383 MB |
public | feeds_stories_map_p_21 | table | mediacloud | permanent | 0 bytes |
And then there are a few smaller tables:
public | feeds_tags_map | table | mediacloud | permanent | 696 kB |
<...>
public | scraped_feeds | table | mediacloud | permanent | 10072 kB |
<...>
public | feeds_from_yesterday | table | mediacloud | permanent | 36 MB |
Tips, tricks, notes and other things that came to mind
-
Bigger tables might have thousands if not hundreds of thousands of rows that reference
feeds.feeds_id
so you'll need to chunk yourUPDATE
s somehow. One way to do this is do getMIN(primary_key)
andMAX(primary_key)
from every referencing table with a specificfeeds_id
(make sure that an index exists that would allow you to do this in a timely manner!), and thenUPDATE
the referencing table in chunks based onprimary_key
. -
Given the "do this, if that succeeds then do that, then ..., and make sure that it all works for thousands of inputs, and you better track progress of all of it, oh, and external components to be updated might go down at any point, and also it's unclear whether individual steps to be executed will work with production's amount of data" nature of this task, I think this is a good chance to try out Temporal. You can use my podcast ingest as a reference:
- Interface: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/podcast-transcribe-episode/src/python/podcast_transcribe_episode/workflow_interface.py
- Implementation: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/podcast-transcribe-episode/src/python/podcast_transcribe_episode/workflow.py
- Test that writes some mock data, runs the workflow and checks what came out on the other end:
backend/apps/podcast-transcribe-episode/tests/python/test_workflow.py
Lines 80 to 183 in f0c523e
- Some docs: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/doc/workflows.markdown
-
IMHO this is one of these tasks that become easier if you write yourself a good test that confirms that your code is doing what you want it to do exactly. So, make sure to write at least one good test that preloads a testing database with some mock duplicate feeds (to both
feeds
and other referencing tables), runs the workflow and makes sure that feeds got merged and nothing got lost in the process. The test should also test outUPDATE
s in chunks and other border cases that come to mind.
As always, complain loudly and early if something's unclear!