Skip to content

skinapi2025/owlkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OwlKit

AI-Powered User Profiling Tool - Extract browsing history from Chrome and generate structured user profiles for personalization.

Python Version License Version

Features

  • History Extraction: Extract browsing history from Chrome's SQLite database
  • URL Normalization: Remove tracking parameters, deduplicate URLs, HTTP→HTTPS normalization
  • Interest Classification: Categorize browsing into 10 interest areas
  • Page Type Classification: Domain-specific URL structure classification (product, search, shop, etc.)
  • Entity Recognition: Extract companies and tech stacks from page titles
  • Search Keyword Analysis: Extract search intents from both search engines and e-commerce platforms
  • Product Deduplication: Deduplicate products across domains by item ID
  • Shop Deduplication: Merge shop/shop_category pages by shop name across subdomains
  • Time Pattern Analysis: Analyze browsing patterns by time of day
  • Trend Detection: Compare recent activity vs historical trends
  • Multi-Output Formats: JSON, Markdown, CSV output
  • AI-Ready: Designed for LLM integration with stable JSON schema
  • Configurable: Easy to customize classification rules via JSON config
  • Filterable: Filter by domain presets and page types

Installation

python owlkit.py -o profile.json -f json

Quick Start

CLI Usage (Human)

# Default: 30 days, JSON output to console
python owlkit.py

# Export Markdown report (human-friendly)
python owlkit.py -d 30 -o report.md -f md

# Export JSON profile
python owlkit.py -o profile.json -f json

# 7 days of history
python owlkit.py -d 7 -o week_report.json -f json

AI Subprocess Call

# Output to stdout (for AI subprocess integration)
python owlkit.py -d 30 -f json --stdout

# Lightweight mode (exclude recent visits)
python owlkit.py -d 30 -f json --stdout --no-recent

# Filter by e-commerce sites only
python owlkit.py -d 30 -f json --filter-domains ecommerce

# Filter by specific domains
python owlkit.py -d 30 -f json --filter-domains taobao.com,jd.com

# Mix preset and explicit domains
python owlkit.py -d 30 -f json --filter-domains ai,github.com

# Filter by page types (only search and product pages)
python owlkit.py -d 30 -f json --filter-domains ecommerce --page-types search,product

# Filter by shop-related pages
python owlkit.py -d 30 -f json --filter-domains ecommerce --page-types shop,shop_category

Domain Presets

# List all available presets
python owlkit.py --list-presets
Preset Domains
ecommerce taobao.com, tmall.com, jd.com, 1688.com, pinduoduo.com, amazon.com, amazon.cn, youpin.com
ai kimi.com, deepseek.com, claude.ai, chat.openai.com, poe.com
social weibo.com, zhihu.com, xiaohongshu.com, twitter.com, reddit.com
video youtube.com, bilibili.com, douyin.com, netflix.com, twitch.tv
finance xueqiu.com, eastmoney.com, sina.com.cn, cls.cn
tech github.com, gitlab.com, stackoverflow.com, npmjs.com, pypi.org
search google.com, baidu.com, bing.com, sogou.com, duckduckgo.com

Page Types

When using --page-types, only records matching the specified page types are included in analysis data. user_profile always reflects the full unfiltered data.

Page Type Description Example
product Product detail page item.taobao.com/item.htm?id=xxx
search Search results page s.taobao.com/search?q=xxx
shop Shop homepage shop123.taobao.com/
shop_category Shop category page shop123.taobao.com/category-xxx
cart Shopping cart cart.taobao.com
orders Order list buyertrade.taobao.com
order Order confirmation buy.taobao.com
chat Customer chat market.m.taobao.com/app/im/chat
home Site homepage www.taobao.com/
login Login page login.taobao.com
account Account settings i.taobao.com
redirect Ad redirect click.simba.taobao.com
market Marketing page market.m.taobao.com (non-chat)

Python API

from owlkit import get_profile

# Get 7-day profile
profile = get_profile(days=7)

# Get full profile with recent visits
profile = get_profile(days=30, include_recent=True)

# Filter by e-commerce sites only
profile = get_profile(days=30, filter_domains=["ecommerce"])

# Filter by specific domains
profile = get_profile(days=30, filter_domains=["taobao.com", "jd.com"])

# Filter by page types
profile = get_profile(days=30, filter_domains=["ecommerce"], filter_page_types={"search", "product"})

# Access profile data
print(profile["user_profile"]["role_tags"])
print(profile["user_profile"]["primary_interests"])
print(profile["entities"]["tech_stack"])

Configuration

OwlKit uses a JSON configuration file for customizing interest classification. This makes it easy for AI systems to maintain and update classification rules without modifying code.

Config File Location

owlkit/
├── owlkit.py
└── config/
    └── categories.json

Config Structure

{
  "_schema": "owlkit-categories-v1.0",
  "_version": "1.0.0",
  "categories": {
    "Category Name": {
      "weight": 1.0,
      "domains": ["example.com"],
      "forced_domains": ["example.com"],
      "path_patterns": ["/path"],
      "exclude_paths": ["/exclude"],
      "domain_rules": {
        "example.com": [
          {"subdomain": "item", "page_type": "product"},
          {"subdomain": "s", "path": "/search", "page_type": "search"},
          {"subdomain": "*", "page_type": "shop"}
        ]
      },
      "title_keywords": {
        "zh": ["中文关键词"],
        "en": ["English keywords"]
      }
    }
  },
  "tech_keywords": ["ESP32", "Arduino", ...],
  "entity_blacklist": ["无效实体"],
  "valid_company_suffixes": ["科技", "公司"]
}

Domain Rules

Domain rules define how URLs are classified into page types based on subdomain and path patterns:

  • subdomain: Supports exact match ("item"), prefix wildcard ("shop*"), and catch-all ("*")
  • path: Matches if the path contains the given string
  • page_type: The classification result

Rules are evaluated in order; the first match wins. The catch-all * should be placed last.

Adding Custom Categories

Edit config/categories.json to add new categories:

{
  "categories": {
    "IoT Devices": {
      "weight": 1.0,
      "domains": ["esp32.com", "arduino.cc"],
      "path_patterns": ["/board", "/product"],
      "domain_rules": {
        "esp32.com": [
          {"subdomain": "shop", "page_type": "shop"},
          {"subdomain": "*", "page_type": "product"}
        ]
      },
      "title_keywords": {
        "zh": ["开发板", "单片机", "传感器"],
        "en": ["development board", "microcontroller", "sensor", "IoT"]
      }
    }
  }
}

Output Schema

{
  "_schema": "owlkit-profile-v1.0",
  "_version": "1.0.2",
  "_generated_at": "2026-05-04T20:00:00+08:00",
  "_source": "chrome_history",

  "user_profile": {
    "role_tags": ["Shopper", "Embedded Engineer", "Programmer"],
    "primary_interests": ["E-commerce", "Search Engines"]
  },

  "statistics": {
    "time_range_days": 30,
    "total_records": 465,
    "unique_domains": 50,
    "unique_products": 51,
    "classification_coverage": 100.0
  },

  "interest_areas": [
    {"tag": "E-commerce", "count": 459, "weight": 0.981, "percentage": 98.1, "top_sites": ["s.taobao.com", "item.taobao.com"]}
  ],

  "top_domains": [
    {"domain": "s.taobao.com", "visits": 117, "percentage": 19.4}
  ],

  "search_keywords": [
    {"keyword": "乐鑫", "count": 21}
  ],

  "page_type_stats": {
    "product": {"count": 188, "percentage": 40.4, "top_domains": ["item.taobao.com", "detail.tmall.com"]},
    "search": {"count": 136, "percentage": 29.2, "top_domains": ["s.taobao.com", "s.1688.com"]},
    "shop": {"count": 59, "percentage": 12.7, "top_domains": ["shop113593007.taobao.com"]}
  },

  "product_stats": [
    {"item_id": "817841934360", "domain": "detail.tmall.com", "title": "ESP-32 ...", "visits": 54}
  ],

  "entities": {
    "stocks": [],
    "companies": ["乐鑫科技", "涂鸦智能"],
    "tech_stack": ["ESP32", "Zigbee", "BLE", "Python", "WiFi"]
  },

  "browsing_patterns": {
    "morning": 84,
    "afternoon": 100,
    "evening": 261,
    "night": 20
  },

  "trend": {
    "status": "increasing",
    "recent_7d": 291,
    "prior_7_14d": 42
  },

  "recent_visits": [
    {
      "title": "ESP32-S3-DevKitC-1 乐鑫科技 ESP32-S3开发板-淘宝网",
      "url": "https://item.taobao.com/item.htm?id=653155344338",
      "domain": "item.taobao.com",
      "page_type": "product",
      "item_id": "653155344338",
      "search_query": null,
      "shop_name": null,
      "visit_count": 27,
      "last_visit": "2026-05-04T09:02:57+08:00"
    },
    {
      "title": "首页-乐鑫科技Espressif Online-淘宝网",
      "url": "https://shop113593007.taobao.com/",
      "domain": "shop113593007.taobao.com",
      "page_type": "shop",
      "item_id": null,
      "search_query": null,
      "shop_name": "乐鑫科技Espressif Online",
      "visit_count": 25,
      "last_visit": "2026-05-02T10:47:38+08:00"
    }
  ]
}

Data Filtering Behavior

When --page-types is specified:

Section Data Source Description
user_profile Full data (filter_domains only) Overall user portrait, always reflects complete profile
interest_areas Filtered data (filter_domains + page_types) Interest distribution within selected page types
search_keywords Filtered data Keywords from selected page types
entities Filtered data Entities from selected page types
statistics Filtered data Statistics for selected page types
top_domains Filtered data Domain ranking for selected page types
page_type_stats Filtered data Page type breakdown
browsing_patterns Filtered data Time patterns for selected page types
trend Filtered data Trend for selected page types
product_stats Filtered data Product stats for selected page types
recent_visits Filtered data Recent visits for selected page types

Deduplication Rules

Recent visits are deduplicated with the following priority:

  1. Item ID (cross-domain): Same product ID on different domains (e.g., detail.tmall.com and item.taobao.com) → merged into one entry
  2. Search query + domain: Same search keyword on the same domain → merged
  3. Shop name: Same shop name from title (e.g., "首页-启明云端官方企业店-淘宝网") → merged across subdomains, shop_category upgraded to shop
  4. Page type + title: Same page type and title → merged
  5. Domain + title: Same domain and title → merged

Interest Categories

Category Description Key Domains
Stocks & Finance Stock & Finance xueqiu.com, eastmoney.com
Technology & Development Technology & Development github.com, stackoverflow.com
AI Assistants AI Assistants kimi.com, chatglm.cn, deepseek.com
Social Media Social Media weibo.com, zhihu.com, xiaohongshu.com
E-commerce E-commerce taobao.com, jd.com, tmall.com
Video & Entertainment Video & Entertainment youtube.com, bilibili.com
Search Engines Search Engines google.com, baidu.com, bing.com
News News news.163.com, thepaper.cn
Productivity Tools Productivity Tools notion.so, feishu.cn, slack.com
Cloud Services Cloud Services aliyun.com, aws.amazon.com

CLI Options

-d, --days          Time window in days (default: 30)
-o, --output        Output file path
-f, --format        Output format: json, md, csv (default: json)
--filter-domains    Filter by domains (comma-separated) or presets: ecommerce, ai, social, video, finance, tech, search
--page-types        Filter by page_type (comma-separated), e.g. search,product,shop
--list-presets      List available domain presets
--stdout            Output to stdout (AI mode)
--no-recent         Exclude recent visits (reduce payload)
--version           Show version

Requirements

  • Python 3.7+
  • Chrome browser installed

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

Adding Custom Classification Rules

  1. Edit config/categories.json
  2. Add or modify categories and domain_rules
  3. Test with python owlkit.py -d 7 -f json --stdout
  4. Verify output meets expectations

Changelog

v1.0.2

  • Fixed README CLI commands to use python owlkit.py instead of python -m owlkit
  • user_profile now always based on full data regardless of --page-types filter
  • entities, interest_areas, search_keywords now follow --page-types filter

v1.0.1

  • Added --page-types CLI option for filtering by page type
  • Added page type classification with domain-specific rules (product, search, shop, etc.)
  • Added product deduplication across domains by item ID
  • Added shop name extraction and cross-subdomain deduplication
  • Added shop_name field in recent_visits
  • Added page_type_stats and product_stats sections
  • Fixed login.taobao.com classified as shop (now correctly "login")
  • Fixed stocks entity extraction removing false positives (CP2102, ST7789, etc.)
  • Fixed search_keywords to include e-commerce platform search queries
  • user_profile now always based on full data regardless of --page-types filter
  • Added 1688.com /offer/ URL classification as product
  • Added shop search.htm classification as search
  • Added redirect page type for click.simba.taobao.com

v1.0.0

  • Initial release

About

Chrome History Extractor & User Profiling Tool

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages