Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Implement workflow to merge feeds into each other #816

Open
pypt opened this issue Oct 21, 2021 · 1 comment
Open

Implement workflow to merge feeds into each other #816

pypt opened this issue Oct 21, 2021 · 1 comment

Comments

@pypt
Copy link
Contributor

pypt commented Oct 21, 2021

So, now that we came up with lists of media sources / feeds to be merged into each other (#799), let's try doing the actual merging.

Given that:

  • We need to merge some individual feeds into each other, and also merge some media sources into each other (the process of which includes merging feeds), as I've mentioned in Come up with a final "what media source gets merged into what / deleted" list #799 (comment);
  • Temporal workflows is a bit new to our stack and neither of us have extensive hands-on knowledge with it;
  • General avoidance of scope creep;
  • The fact that merging feeds into each other is useful to us alone;

I'd suggest that we implement the feed merging first, run it with inputs from feed_actions, see what happens, correct course, and then move on to merging the media sources as a separate task.

As per #799 (comment), the final database of what gets merged into what is:

https://drive.google.com/file/d/1sfQLMwq5OkooDtg3ZjYOTOyNEIzMv2HZ/view?usp=sharing

and for this task we'll need just the feed_actions table.

Outline

The workflow to merge one feed (<src_feeds_id>) into another (<dst_feeds_id>) will look as follows (adapted from #799 (comment)):

  1. Move all rows that reference the feeds table with <src_feeds_id>:
    1. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the downloads table
    2. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_stories_map table, taking into account that there could be duplicates
    3. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the scraped_feeds table, taking into account that there could be duplicates
    4. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_from_yesterday table, taking into account that there could be duplicates
    5. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_tags_map table, taking into account that there could be duplicates
  2. Remove the row with feeds_id = <src_feeds_id> from the feeds table:
    1. Remove rows with feeds_id = <src_feeds_id> from the downloads table - there shouldn't be any left as we've just merged them
    2. Remove rows with feeds_id = <src_feeds_id> from the feeds_stories_map table - there shouldn't be any left as we've just merged them
    3. <...>
    4. Remove rows with feeds_id = <src_feeds_id> from the feeds_tags_map table - there shouldn't be any left as we've just merged them
    5. Remove the actual row from feeds.

Referencing tables

Merging feeds is a bit easier than merging media sources because the feed information don't end up on Solr (so we don't have to update its index in any way), there aren't that many tables that reference rows in feeds (open up https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/postgresql-server/pgmigrate/migrations/V0001__initial_schema.sql and Command+F for feeds_id).

Here's how it looks like on production (feel free to SSH in and look around yourself):

$ ssh woodward

woodward$ docker exec -it $(docker ps | grep postgresql-server | cut -d ' ' -f1) psql

psql# \d+ feeds

                                                                       Table "public.feeds"
            Column             |           Type           | Collation | Nullable |                 Default                 | Storage  | Stats target | Description 
-------------------------------+--------------------------+-----------+----------+-----------------------------------------+----------+--------------+-------------
 feeds_id                      | integer                  |           | not null | nextval('feeds_feeds_id_seq'::regclass) | plain    |              | 
 media_id                      | integer                  |           | not null |                                         | plain    |              | 
 name                          | character varying(512)   |           | not null |                                         | extended |              | 
 url                           | character varying(1024)  |           | not null |                                         | extended |              | 
 last_attempted_download_time  | timestamp with time zone |           |          |                                         | plain    |              | 
 type                          | feed_type                |           | not null | 'syndicated'::feed_type                 | plain    |              | 
 last_new_story_time           | timestamp with time zone |           |          |                                         | plain    |              | 
 last_checksum                 | text                     |           |          |                                         | extended |              | 
 last_successful_download_time | timestamp with time zone |           |          |                                         | plain    |              | 
 active                        | boolean                  |           | not null | true                                    | plain    |              | 
Indexes:
    "feeds_pkey" PRIMARY KEY, btree (feeds_id)
    "feeds_last_attempted_download_time" btree (last_attempted_download_time)
    "feeds_last_successful_download_time" btree (last_successful_download_time)
    "feeds_media" btree (media_id)
    "feeds_name" btree (name)
Foreign-key constraints:
    "feeds_media_id_fkey" FOREIGN KEY (media_id) REFERENCES media(media_id) ON DELETE CASCADE
Referenced by:
    TABLE "downloads" CONSTRAINT "downloads_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id)
    TABLE "feeds_stories_map_p_00" CONSTRAINT "feeds_stories_map_p_00_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_01" CONSTRAINT "feeds_stories_map_p_01_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_02" CONSTRAINT "feeds_stories_map_p_02_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_03" CONSTRAINT "feeds_stories_map_p_03_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_04" CONSTRAINT "feeds_stories_map_p_04_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_05" CONSTRAINT "feeds_stories_map_p_05_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_06" CONSTRAINT "feeds_stories_map_p_06_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_07" CONSTRAINT "feeds_stories_map_p_07_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_08" CONSTRAINT "feeds_stories_map_p_08_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_09" CONSTRAINT "feeds_stories_map_p_09_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_10" CONSTRAINT "feeds_stories_map_p_10_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_11" CONSTRAINT "feeds_stories_map_p_11_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_12" CONSTRAINT "feeds_stories_map_p_12_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_13" CONSTRAINT "feeds_stories_map_p_13_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_14" CONSTRAINT "feeds_stories_map_p_14_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_15" CONSTRAINT "feeds_stories_map_p_15_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_16" CONSTRAINT "feeds_stories_map_p_16_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_17" CONSTRAINT "feeds_stories_map_p_17_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_18" CONSTRAINT "feeds_stories_map_p_18_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_19" CONSTRAINT "feeds_stories_map_p_19_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_20" CONSTRAINT "feeds_stories_map_p_20_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_21" CONSTRAINT "feeds_stories_map_p_21_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_tags_map" CONSTRAINT "feeds_tags_map_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) ON DELETE CASCADE
    TABLE "scraped_feeds" CONSTRAINT "scraped_feeds_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) ON DELETE CASCADE
Access method: heap

So, a bunch of non-partitioned and partitioned tables reference feeds.feeds_id. One table that's missing from this list is feeds_from_yesterday - I remember that the lack of reference is deliberate, I just don't remember why :)

Here are the sizes of all of these tables:

psql# \d+

<...>

 public | downloads                                                       | partitioned table | mediacloud | permanent   | 0 bytes    | 
 public | downloads_error                                                 | table             | mediacloud | permanent   | 34 GB      | 
 public | downloads_feed_error                                            | table             | mediacloud | permanent   | 5828 MB    | 
 public | downloads_fetching                                              | table             | mediacloud | permanent   | 11 MB      | 
 public | downloads_in_past_day                                           | view              | mediacloud | permanent   | 0 bytes    | 
 public | downloads_media                                                 | view              | mediacloud | permanent   | 0 bytes    | 
 public | downloads_non_media                                             | view              | mediacloud | permanent   | 0 bytes    | 
 public | downloads_pending                                               | table             | mediacloud | permanent   | 11 MB      | 
 public | downloads_success                                               | partitioned table | mediacloud | permanent   | 0 bytes    | 
 public | downloads_success_content                                       | partitioned table | mediacloud | permanent   | 0 bytes    | 
 public | downloads_success_content_00                                    | table             | mediacloud | permanent   | 5322 MB    | 
 public | downloads_success_content_01                                    | table             | mediacloud | permanent   | 7182 MB    | 
 public | downloads_success_content_02                                    | table             | mediacloud | permanent   | 11 GB      | 
 public | downloads_success_content_03                                    | table             | mediacloud | permanent   | 12 GB      | 
 public | downloads_success_content_04                                    | table             | mediacloud | permanent   | 10 GB      | 
 public | downloads_success_content_05                                    | table             | mediacloud | permanent   | 10147 MB   | 
 public | downloads_success_content_06                                    | table             | mediacloud | permanent   | 12 GB      | 
 public | downloads_success_content_07                                    | table             | mediacloud | permanent   | 13 GB      | 
 public | downloads_success_content_08                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_09                                    | table             | mediacloud | permanent   | 13 GB      | 
 public | downloads_success_content_10                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_11                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_12                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_13                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_14                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_15                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_16                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_17                                    | table             | mediacloud | permanent   | 16 GB      | 
 public | downloads_success_content_18                                    | table             | mediacloud | permanent   | 16 GB      | 
 public | downloads_success_content_19                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_20                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_21                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_22                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_23                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_24                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_25                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_26                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_27                                    | table             | mediacloud | permanent   | 16 GB      | 
 public | downloads_success_content_28                                    | table             | mediacloud | permanent   | 18 GB      | 
 public | downloads_success_content_29                                    | table             | mediacloud | permanent   | 19 GB      | 
 public | downloads_success_content_30                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_31                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_32                                    | table             | mediacloud | permanent   | 9981 MB    | 
 public | downloads_success_content_33                                    | table             | mediacloud | permanent   | 8192 bytes | 
 public | downloads_success_feed                                          | partitioned table | mediacloud | permanent   | 0 bytes    | 
 public | downloads_success_feed_00                                       | table             | mediacloud | permanent   | 8366 MB    | 
 public | downloads_success_feed_01                                       | table             | mediacloud | permanent   | 5427 MB    | 
 public | downloads_success_feed_02                                       | table             | mediacloud | permanent   | 7832 MB    | 
 public | downloads_success_feed_03                                       | table             | mediacloud | permanent   | 5881 MB    | 
 public | downloads_success_feed_04                                       | table             | mediacloud | permanent   | 6155 MB    | 
 public | downloads_success_feed_05                                       | table             | mediacloud | permanent   | 6575 MB    | 
 public | downloads_success_feed_06                                       | table             | mediacloud | permanent   | 5742 MB    | 
 public | downloads_success_feed_07                                       | table             | mediacloud | permanent   | 5534 MB    | 
 public | downloads_success_feed_08                                       | table             | mediacloud | permanent   | 4877 MB    | 
 public | downloads_success_feed_09                                       | table             | mediacloud | permanent   | 5132 MB    | 
 public | downloads_success_feed_10                                       | table             | mediacloud | permanent   | 5217 MB    | 
 public | downloads_success_feed_11                                       | table             | mediacloud | permanent   | 5072 MB    | 
 public | downloads_success_feed_12                                       | table             | mediacloud | permanent   | 5091 MB    | 
 public | downloads_success_feed_13                                       | table             | mediacloud | permanent   | 5257 MB    | 
 public | downloads_success_feed_14                                       | table             | mediacloud | permanent   | 5209 MB    | 
 public | downloads_success_feed_15                                       | table             | mediacloud | permanent   | 5229 MB    | 
 public | downloads_success_feed_16                                       | table             | mediacloud | permanent   | 3821 MB    | 
 public | downloads_success_feed_17                                       | table             | mediacloud | permanent   | 3078 MB    | 
 public | downloads_success_feed_18                                       | table             | mediacloud | permanent   | 3128 MB    | 
 public | downloads_success_feed_19                                       | table             | mediacloud | permanent   | 4509 MB    | 
 public | downloads_success_feed_20                                       | table             | mediacloud | permanent   | 5425 MB    | 
 public | downloads_success_feed_21                                       | table             | mediacloud | permanent   | 5463 MB    | 
 public | downloads_success_feed_22                                       | table             | mediacloud | permanent   | 5793 MB    | 
 public | downloads_success_feed_23                                       | table             | mediacloud | permanent   | 5041 MB    | 
 public | downloads_success_feed_24                                       | table             | mediacloud | permanent   | 5319 MB    | 
 public | downloads_success_feed_25                                       | table             | mediacloud | permanent   | 5468 MB    | 
 public | downloads_success_feed_26                                       | table             | mediacloud | permanent   | 5396 MB    | 
 public | downloads_success_feed_27                                       | table             | mediacloud | permanent   | 5224 MB    | 
 public | downloads_success_feed_28                                       | table             | mediacloud | permanent   | 5124 MB    | 
 public | downloads_success_feed_29                                       | table             | mediacloud | permanent   | 4987 MB    | 
 public | downloads_success_feed_30                                       | table             | mediacloud | permanent   | 5155 MB    | 
 public | downloads_success_feed_31                                       | table             | mediacloud | permanent   | 5228 MB    | 
 public | downloads_success_feed_32                                       | table             | mediacloud | permanent   | 3393 MB    | 
 public | downloads_success_feed_33                                       | table             | mediacloud | permanent   | 8192 bytes | 

downloads is partitioned by state column and then further by type, so, for example, a download with state = 'success' and type = 'feed' would end up in one of the downloads_success_feed tables.

For your purposes I think you can pretty much ignore the fact that it's partitioned and just UPDATE the base downloads table directly.

Another partitioned table is feeds_stories_map:

 public | feeds_stories_map                                               | view              | mediacloud | permanent   | 0 bytes    | 
 public | feeds_stories_map_p                                             | table             | mediacloud | permanent   | 0 bytes    | 
 public | feeds_stories_map_p_00                                          | table             | mediacloud | permanent   | 3148 MB    | 
 public | feeds_stories_map_p_01                                          | table             | mediacloud | permanent   | 4078 MB    | 
 public | feeds_stories_map_p_02                                          | table             | mediacloud | permanent   | 4079 MB    | 
 public | feeds_stories_map_p_03                                          | table             | mediacloud | permanent   | 4206 MB    | 
 public | feeds_stories_map_p_04                                          | table             | mediacloud | permanent   | 4208 MB    | 
 public | feeds_stories_map_p_05                                          | table             | mediacloud | permanent   | 4219 MB    | 
 public | feeds_stories_map_p_06                                          | table             | mediacloud | permanent   | 4223 MB    | 
 public | feeds_stories_map_p_07                                          | table             | mediacloud | permanent   | 4220 MB    | 
 public | feeds_stories_map_p_08                                          | table             | mediacloud | permanent   | 4217 MB    | 
 public | feeds_stories_map_p_09                                          | table             | mediacloud | permanent   | 4221 MB    | 
 public | feeds_stories_map_p_10                                          | table             | mediacloud | permanent   | 4227 MB    | 
 public | feeds_stories_map_p_11                                          | table             | mediacloud | permanent   | 4211 MB    | 
 public | feeds_stories_map_p_12                                          | table             | mediacloud | permanent   | 4173 MB    | 
 public | feeds_stories_map_p_13                                          | table             | mediacloud | permanent   | 4424 MB    | 
 public | feeds_stories_map_p_14                                          | table             | mediacloud | permanent   | 4293 MB    | 
 public | feeds_stories_map_p_15                                          | table             | mediacloud | permanent   | 4241 MB    | 
 public | feeds_stories_map_p_16                                          | table             | mediacloud | permanent   | 4251 MB    | 
 public | feeds_stories_map_p_17                                          | table             | mediacloud | permanent   | 4210 MB    | 
 public | feeds_stories_map_p_18                                          | table             | mediacloud | permanent   | 4215 MB    | 
 public | feeds_stories_map_p_19                                          | table             | mediacloud | permanent   | 4282 MB    | 
 public | feeds_stories_map_p_20                                          | table             | mediacloud | permanent   | 3383 MB    | 
 public | feeds_stories_map_p_21                                          | table             | mediacloud | permanent   | 0 bytes    | 

And then there are a few smaller tables:

 public | feeds_tags_map                                                  | table             | mediacloud | permanent   | 696 kB     | 

<...>

 public | scraped_feeds                                                   | table             | mediacloud | permanent   | 10072 kB   | 

<...>

 public | feeds_from_yesterday                                            | table             | mediacloud | permanent   | 36 MB      | 

Tips, tricks, notes and other things that came to mind

  • Bigger tables might have thousands if not hundreds of thousands of rows that reference feeds.feeds_id so you'll need to chunk your UPDATEs somehow. One way to do this is do get MIN(primary_key) and MAX(primary_key) from every referencing table with a specific feeds_id (make sure that an index exists that would allow you to do this in a timely manner!), and then UPDATE the referencing table in chunks based on primary_key.

  • Given the "do this, if that succeeds then do that, then ..., and make sure that it all works for thousands of inputs, and you better track progress of all of it, oh, and external components to be updated might go down at any point, and also it's unclear whether individual steps to be executed will work with production's amount of data" nature of this task, I think this is a good chance to try out Temporal. You can use my podcast ingest as a reference:

    • Interface: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/podcast-transcribe-episode/src/python/podcast_transcribe_episode/workflow_interface.py
    • Implementation: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/podcast-transcribe-episode/src/python/podcast_transcribe_episode/workflow.py
    • Test that writes some mock data, runs the workflow and checks what came out on the other end:
      @pytest.mark.asyncio
      async def test_workflow():
      db = connect_to_db()
      test_medium = create_test_medium(db=db, label='test')
      test_feed = create_test_feed(db=db, label='test', medium=test_medium)
      # 'label' is important as it will be stored in both stories.title and stories.description, which in turn will be
      # used to guess the probable language of the podcast episode
      test_story = create_test_story(db=db, label='keeping up with Kardashians', feed=test_feed)
      stories_id = test_story['stories_id']
      with open(TEST_MP3_PATH, mode='rb') as f:
      test_mp3_data = f.read()
      # noinspection PyUnusedLocal
      def __mp3_callback(request: HashServer.Request) -> Union[str, bytes]:
      response = "".encode('utf-8')
      response += "HTTP/1.0 200 OK\r\n".encode('utf-8')
      response += "Content-Type: audio/mpeg\r\n".encode('utf-8')
      response += f"Content-Length: {len(test_mp3_data)}\r\n".encode('utf-8')
      response += "\r\n".encode('utf-8')
      response += test_mp3_data
      return response
      port = random_unused_port()
      pages = {
      '/test.mp3': {
      'callback': __mp3_callback,
      }
      }
      hs = HashServer(port=port, pages=pages)
      hs.start()
      # Not localhost as this might get fetched from a remote worker
      mp3_url = hs.page_url('/test.mp3')
      db.insert(table='story_enclosures', insert_hash={
      'stories_id': stories_id,
      'url': mp3_url,
      'mime_type': 'audio/mpeg',
      'length': len(test_mp3_data),
      })
      client = workflow_client()
      # Start worker
      factory = WorkerFactory(client=client, namespace=client.namespace)
      worker = factory.new_worker(task_queue=TASK_QUEUE)
      # Use an activities implementation with random GCS prefixes set
      activities = _RandomPrefixesPodcastTranscribeActivities()
      worker.register_activities_implementation(
      activities_instance=activities,
      activities_cls_name=PodcastTranscribeActivities.__name__,
      )
      worker.register_workflow_implementation_type(impl_cls=PodcastTranscribeWorkflowImpl)
      factory.start()
      # Initialize workflow instance
      workflow: PodcastTranscribeWorkflow = client.new_workflow_stub(
      cls=PodcastTranscribeWorkflow,
      workflow_options=WorkflowOptions(
      workflow_id=str(stories_id),
      # By default, if individual activities of the workflow fail, they will get restarted pretty much
      # indefinitely, and so this test might run for days (or rather just timeout on the CI). So we cap the
      # workflow so that if it doesn't manage to complete in X minutes, we consider it as failed.
      workflow_run_timeout=timedelta(minutes=5),
      ),
      )
      # Wait for the workflow to complete
      await workflow.transcribe_episode(stories_id)
      downloads = db.select(table='downloads', what_to_select='*').hashes()
      assert len(downloads) == 1
      first_download = downloads[0]
      assert first_download['stories_id'] == stories_id
      assert first_download['type'] == 'content'
      assert first_download['state'] == 'success'
      download_content = fetch_content(db=db, download=first_download)
      # It's what gets said in the sample MP3 file
      assert 'Kim Kardashian' in download_content
      # Initiate the worker shutdown in the background while we do the GCS cleanup so that the stop_workers_faster()
      # doesn't have to wait that long
      await worker.stop(background=True)
      log.info("Cleaning up GCS...")
      GCSStore(bucket_config=activities.config.raw_enclosures()).delete_object(object_id=str(stories_id))
      GCSStore(bucket_config=activities.config.transcoded_episodes()).delete_object(object_id=str(stories_id))
      GCSStore(bucket_config=activities.config.transcripts()).delete_object(object_id=str(stories_id))
      log.info("Cleaned up GCS")
      log.info("Stopping workers...")
      await stop_worker_faster(worker)
      log.info("Stopped workers")
    • Some docs: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/doc/workflows.markdown
  • IMHO this is one of these tasks that become easier if you write yourself a good test that confirms that your code is doing what you want it to do exactly. So, make sure to write at least one good test that preloads a testing database with some mock duplicate feeds (to both feeds and other referencing tables), runs the workflow and makes sure that feeds got merged and nothing got lost in the process. The test should also test out UPDATEs in chunks and other border cases that come to mind.

As always, complain loudly and early if something's unclear!

@pypt
Copy link
Contributor Author

pypt commented Oct 21, 2021

As for the workflow-as-in-the-how-do-we-code-this-thing-together-workflow, whenever you come up with at least the pre-barebones / scaffolding of what would eventually end up being the feed merging workflow, could you submit a PR and then we'll iteratively stare at it and work it out together?

Or do you have any other ideas on how we could go about implementing this?

@jtotoole jtotoole removed their assignment Dec 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants