AI-Powered User Profiling Tool - Extract browsing history from Chrome and generate structured user profiles for personalization.
- History Extraction: Extract browsing history from Chrome's SQLite database
- URL Normalization: Remove tracking parameters, deduplicate URLs, HTTP→HTTPS normalization
- Interest Classification: Categorize browsing into 10 interest areas
- Page Type Classification: Domain-specific URL structure classification (product, search, shop, etc.)
- Entity Recognition: Extract companies and tech stacks from page titles
- Search Keyword Analysis: Extract search intents from both search engines and e-commerce platforms
- Product Deduplication: Deduplicate products across domains by item ID
- Shop Deduplication: Merge shop/shop_category pages by shop name across subdomains
- Time Pattern Analysis: Analyze browsing patterns by time of day
- Trend Detection: Compare recent activity vs historical trends
- Multi-Output Formats: JSON, Markdown, CSV output
- AI-Ready: Designed for LLM integration with stable JSON schema
- Configurable: Easy to customize classification rules via JSON config
- Filterable: Filter by domain presets and page types
python owlkit.py -o profile.json -f json# Default: 30 days, JSON output to console
python owlkit.py
# Export Markdown report (human-friendly)
python owlkit.py -d 30 -o report.md -f md
# Export JSON profile
python owlkit.py -o profile.json -f json
# 7 days of history
python owlkit.py -d 7 -o week_report.json -f json# Output to stdout (for AI subprocess integration)
python owlkit.py -d 30 -f json --stdout
# Lightweight mode (exclude recent visits)
python owlkit.py -d 30 -f json --stdout --no-recent
# Filter by e-commerce sites only
python owlkit.py -d 30 -f json --filter-domains ecommerce
# Filter by specific domains
python owlkit.py -d 30 -f json --filter-domains taobao.com,jd.com
# Mix preset and explicit domains
python owlkit.py -d 30 -f json --filter-domains ai,github.com
# Filter by page types (only search and product pages)
python owlkit.py -d 30 -f json --filter-domains ecommerce --page-types search,product
# Filter by shop-related pages
python owlkit.py -d 30 -f json --filter-domains ecommerce --page-types shop,shop_category# List all available presets
python owlkit.py --list-presets| Preset | Domains |
|---|---|
ecommerce |
taobao.com, tmall.com, jd.com, 1688.com, pinduoduo.com, amazon.com, amazon.cn, youpin.com |
ai |
kimi.com, deepseek.com, claude.ai, chat.openai.com, poe.com |
social |
weibo.com, zhihu.com, xiaohongshu.com, twitter.com, reddit.com |
video |
youtube.com, bilibili.com, douyin.com, netflix.com, twitch.tv |
finance |
xueqiu.com, eastmoney.com, sina.com.cn, cls.cn |
tech |
github.com, gitlab.com, stackoverflow.com, npmjs.com, pypi.org |
search |
google.com, baidu.com, bing.com, sogou.com, duckduckgo.com |
When using --page-types, only records matching the specified page types are included in analysis data. user_profile always reflects the full unfiltered data.
| Page Type | Description | Example |
|---|---|---|
product |
Product detail page | item.taobao.com/item.htm?id=xxx |
search |
Search results page | s.taobao.com/search?q=xxx |
shop |
Shop homepage | shop123.taobao.com/ |
shop_category |
Shop category page | shop123.taobao.com/category-xxx |
cart |
Shopping cart | cart.taobao.com |
orders |
Order list | buyertrade.taobao.com |
order |
Order confirmation | buy.taobao.com |
chat |
Customer chat | market.m.taobao.com/app/im/chat |
home |
Site homepage | www.taobao.com/ |
login |
Login page | login.taobao.com |
account |
Account settings | i.taobao.com |
redirect |
Ad redirect | click.simba.taobao.com |
market |
Marketing page | market.m.taobao.com (non-chat) |
from owlkit import get_profile
# Get 7-day profile
profile = get_profile(days=7)
# Get full profile with recent visits
profile = get_profile(days=30, include_recent=True)
# Filter by e-commerce sites only
profile = get_profile(days=30, filter_domains=["ecommerce"])
# Filter by specific domains
profile = get_profile(days=30, filter_domains=["taobao.com", "jd.com"])
# Filter by page types
profile = get_profile(days=30, filter_domains=["ecommerce"], filter_page_types={"search", "product"})
# Access profile data
print(profile["user_profile"]["role_tags"])
print(profile["user_profile"]["primary_interests"])
print(profile["entities"]["tech_stack"])OwlKit uses a JSON configuration file for customizing interest classification. This makes it easy for AI systems to maintain and update classification rules without modifying code.
owlkit/
├── owlkit.py
└── config/
└── categories.json
{
"_schema": "owlkit-categories-v1.0",
"_version": "1.0.0",
"categories": {
"Category Name": {
"weight": 1.0,
"domains": ["example.com"],
"forced_domains": ["example.com"],
"path_patterns": ["/path"],
"exclude_paths": ["/exclude"],
"domain_rules": {
"example.com": [
{"subdomain": "item", "page_type": "product"},
{"subdomain": "s", "path": "/search", "page_type": "search"},
{"subdomain": "*", "page_type": "shop"}
]
},
"title_keywords": {
"zh": ["中文关键词"],
"en": ["English keywords"]
}
}
},
"tech_keywords": ["ESP32", "Arduino", ...],
"entity_blacklist": ["无效实体"],
"valid_company_suffixes": ["科技", "公司"]
}Domain rules define how URLs are classified into page types based on subdomain and path patterns:
subdomain: Supports exact match ("item"), prefix wildcard ("shop*"), and catch-all ("*")path: Matches if the path contains the given stringpage_type: The classification result
Rules are evaluated in order; the first match wins. The catch-all * should be placed last.
Edit config/categories.json to add new categories:
{
"categories": {
"IoT Devices": {
"weight": 1.0,
"domains": ["esp32.com", "arduino.cc"],
"path_patterns": ["/board", "/product"],
"domain_rules": {
"esp32.com": [
{"subdomain": "shop", "page_type": "shop"},
{"subdomain": "*", "page_type": "product"}
]
},
"title_keywords": {
"zh": ["开发板", "单片机", "传感器"],
"en": ["development board", "microcontroller", "sensor", "IoT"]
}
}
}
}{
"_schema": "owlkit-profile-v1.0",
"_version": "1.0.2",
"_generated_at": "2026-05-04T20:00:00+08:00",
"_source": "chrome_history",
"user_profile": {
"role_tags": ["Shopper", "Embedded Engineer", "Programmer"],
"primary_interests": ["E-commerce", "Search Engines"]
},
"statistics": {
"time_range_days": 30,
"total_records": 465,
"unique_domains": 50,
"unique_products": 51,
"classification_coverage": 100.0
},
"interest_areas": [
{"tag": "E-commerce", "count": 459, "weight": 0.981, "percentage": 98.1, "top_sites": ["s.taobao.com", "item.taobao.com"]}
],
"top_domains": [
{"domain": "s.taobao.com", "visits": 117, "percentage": 19.4}
],
"search_keywords": [
{"keyword": "乐鑫", "count": 21}
],
"page_type_stats": {
"product": {"count": 188, "percentage": 40.4, "top_domains": ["item.taobao.com", "detail.tmall.com"]},
"search": {"count": 136, "percentage": 29.2, "top_domains": ["s.taobao.com", "s.1688.com"]},
"shop": {"count": 59, "percentage": 12.7, "top_domains": ["shop113593007.taobao.com"]}
},
"product_stats": [
{"item_id": "817841934360", "domain": "detail.tmall.com", "title": "ESP-32 ...", "visits": 54}
],
"entities": {
"stocks": [],
"companies": ["乐鑫科技", "涂鸦智能"],
"tech_stack": ["ESP32", "Zigbee", "BLE", "Python", "WiFi"]
},
"browsing_patterns": {
"morning": 84,
"afternoon": 100,
"evening": 261,
"night": 20
},
"trend": {
"status": "increasing",
"recent_7d": 291,
"prior_7_14d": 42
},
"recent_visits": [
{
"title": "ESP32-S3-DevKitC-1 乐鑫科技 ESP32-S3开发板-淘宝网",
"url": "https://item.taobao.com/item.htm?id=653155344338",
"domain": "item.taobao.com",
"page_type": "product",
"item_id": "653155344338",
"search_query": null,
"shop_name": null,
"visit_count": 27,
"last_visit": "2026-05-04T09:02:57+08:00"
},
{
"title": "首页-乐鑫科技Espressif Online-淘宝网",
"url": "https://shop113593007.taobao.com/",
"domain": "shop113593007.taobao.com",
"page_type": "shop",
"item_id": null,
"search_query": null,
"shop_name": "乐鑫科技Espressif Online",
"visit_count": 25,
"last_visit": "2026-05-02T10:47:38+08:00"
}
]
}When --page-types is specified:
| Section | Data Source | Description |
|---|---|---|
user_profile |
Full data (filter_domains only) | Overall user portrait, always reflects complete profile |
interest_areas |
Filtered data (filter_domains + page_types) | Interest distribution within selected page types |
search_keywords |
Filtered data | Keywords from selected page types |
entities |
Filtered data | Entities from selected page types |
statistics |
Filtered data | Statistics for selected page types |
top_domains |
Filtered data | Domain ranking for selected page types |
page_type_stats |
Filtered data | Page type breakdown |
browsing_patterns |
Filtered data | Time patterns for selected page types |
trend |
Filtered data | Trend for selected page types |
product_stats |
Filtered data | Product stats for selected page types |
recent_visits |
Filtered data | Recent visits for selected page types |
Recent visits are deduplicated with the following priority:
- Item ID (cross-domain): Same product ID on different domains (e.g., detail.tmall.com and item.taobao.com) → merged into one entry
- Search query + domain: Same search keyword on the same domain → merged
- Shop name: Same shop name from title (e.g., "首页-启明云端官方企业店-淘宝网") → merged across subdomains, shop_category upgraded to shop
- Page type + title: Same page type and title → merged
- Domain + title: Same domain and title → merged
| Category | Description | Key Domains |
|---|---|---|
| Stocks & Finance | Stock & Finance | xueqiu.com, eastmoney.com |
| Technology & Development | Technology & Development | github.com, stackoverflow.com |
| AI Assistants | AI Assistants | kimi.com, chatglm.cn, deepseek.com |
| Social Media | Social Media | weibo.com, zhihu.com, xiaohongshu.com |
| E-commerce | E-commerce | taobao.com, jd.com, tmall.com |
| Video & Entertainment | Video & Entertainment | youtube.com, bilibili.com |
| Search Engines | Search Engines | google.com, baidu.com, bing.com |
| News | News | news.163.com, thepaper.cn |
| Productivity Tools | Productivity Tools | notion.so, feishu.cn, slack.com |
| Cloud Services | Cloud Services | aliyun.com, aws.amazon.com |
-d, --days Time window in days (default: 30)
-o, --output Output file path
-f, --format Output format: json, md, csv (default: json)
--filter-domains Filter by domains (comma-separated) or presets: ecommerce, ai, social, video, finance, tech, search
--page-types Filter by page_type (comma-separated), e.g. search,product,shop
--list-presets List available domain presets
--stdout Output to stdout (AI mode)
--no-recent Exclude recent visits (reduce payload)
--version Show version
- Python 3.7+
- Chrome browser installed
MIT License - see LICENSE for details.
Contributions are welcome! Please feel free to submit issues and pull requests.
- Edit
config/categories.json - Add or modify categories and domain_rules
- Test with
python owlkit.py -d 7 -f json --stdout - Verify output meets expectations
- Fixed README CLI commands to use
python owlkit.pyinstead ofpython -m owlkit - user_profile now always based on full data regardless of --page-types filter
- entities, interest_areas, search_keywords now follow --page-types filter
- Added
--page-typesCLI option for filtering by page type - Added page type classification with domain-specific rules (product, search, shop, etc.)
- Added product deduplication across domains by item ID
- Added shop name extraction and cross-subdomain deduplication
- Added
shop_namefield in recent_visits - Added
page_type_statsandproduct_statssections - Fixed login.taobao.com classified as shop (now correctly "login")
- Fixed stocks entity extraction removing false positives (CP2102, ST7789, etc.)
- Fixed search_keywords to include e-commerce platform search queries
- user_profile now always based on full data regardless of --page-types filter
- Added 1688.com /offer/ URL classification as product
- Added shop search.htm classification as search
- Added redirect page type for click.simba.taobao.com
- Initial release