Skip to content

use md5 for url_hash #121

@andreawwenyi

Description

@andreawwenyi

right now we are creating url_hash by using zlib.crc32 in python. It will be less work if we use mysql md5 for url_hash, like we did for redirect_to_hash (issue #120). However this requires some major change in our current code:

  • migration: update url_hash to binary(16), make url_hash = unhex(md5(url)), and create index for new url_hash
  • update queries that inserted url_hash:
    • insert_article.sql
    • get_article_id_by_url.sql
    • get_article_by_url.sql
  • update scripts that use zlib.crc32 for url_hash and uses the queries mentioned above:
    • newsSpiders/ptt.py
    • toutiao_discover_spider.py
    • basic_discover_spider.py
    • dcard_dicsover_spider.py
    • webapi/articles.py
    • zs-article.py
    • newsSpiders/webapi/articles.py
    • newsSpiders/items.py
    • newsSpiders/pipelines.py

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions