Hugging Face User Crawler

A Scrapy-based web crawler designed to collect and analyze user data from Hugging Face, the popular machine learning platform. This project helps gather insights about user activities, contributions, and interactions within the Hugging Face community.

Data Link

Scraping tasks: OneDrive
User profiles (result): Sample on OneDrive[TBD]
Social networks: Sample on OneDrive [TBD]
Model details [TBD]
Kaggle data card: https://www.kaggle.com/datasets/reypku/hugging-face-user-profiles-and-social-networks/data [No data yet; TBD]

Sample Results

Tasks generated

user         , url
crossdelenna , https://huggingface.co/crossdelenna
Teetouch     , https://huggingface.co/Teetouch
Janouille    , https://huggingface.co/Janouille
orestxherija , https://huggingface.co/orestxherija
edmundhui    , https://huggingface.co/edmundhui

User profile

{
    "user_id": "Teetouch",
    "user_name": "Teetouch Jaknamon",
    "user_meta": "{\"lastUserActivities\":[],\"blogPosts\":[],\"totalBlogPosts\":0,\"canReadDatabase\":false,\"canManageEntities\":false,\"canReadEntities\":false,\"canImpersonate\":false,\"canManageBilling\":false,\"communityScore\":0,\"collections\":[],\"datasets\":[],\"models\":[{\"author\":\"Teetouch\",\"authorData\":{\"_id\":\"620b0b423c0931626a7c92c2\",\"avatarUrl\":\"/avatars/d150cef7965877a88d7400c431c626d7.svg\",\"fullname\":\"Teetouch Jaknamon\",\"name\":\"Teetouch\",\"type\":\"user\",\"isPro\":false,\"isHf\":false,\"isHfAdmin\":false,\"isMod\":false},\"downloads\":0,\"gated\":false,\"id\":\"Teetouch/TEETOUQQ2222-attacut-th-to-en-pt2\",\"availableInferenceProviders\":[],\"lastModified\":\"2022-03-10T17:45:31.000Z\",\"likes\":0,\"private\":false,\"repoType\":\"model\",\"isLikedByUser\":false}],\"numberLikes\":0,\"papers\":[],\"posts\":[],\"totalPosts\":0,\"spaces\":[],\"u\":{\"avatarUrl\":\"/avatars/d150cef7965877a88d7400c431c626d7.svg\",\"isPro\":false,\"fullname\":\"Teetouch Jaknamon\",\"user\":\"Teetouch\",\"orgs\":[],\"signup\":{},\"isHf\":false,\"isMod\":false,\"type\":\"user\"},\"upvotes\":0,\"repoFilterModels\":{\"sortKey\":\"modified\"},\"repoFilterDatasets\":{\"sortKey\":\"modified\"},\"repoFilterSpaces\":{\"sortKey\":\"modified\"},\"numFollowers\":0,\"numFollowingUsers\":0,\"numFollowingOrgs\":0,\"isFollowing\":false,\"isFollower\":false,\"sampleFollowers\":[],\"isWatching\":false,\"acceptLanguages\":[\"en\"]}",
    "team": null,
    "follower_amount": 0,
    "follower_meta": [],
    "following_amount": 0,
    "following_meta": []
}

Social edges

source         , target
mktn           , dc0420
antoniomae     , Kentuss
Mohamedelamoury, nlptown
simarHug       , nlptown
Wauplin        , nlptown

Social network

Project Overview

This crawler is built to systematically collect user information from Hugging Face, including:

User profiles
Model contributions
Dataset contributions
Community interactions
Activity statistics

Folder Structure

hugging-face-user-scrawler/
├── config/               # Configuration files
├── data/                # Scraped data storage
├── doc/                 # Project documentation
├── hf_user_scrawler/    # Main crawler package
│   ├── spiders/        # Spider definitions
│   └── scrapy.cfg      # Scrapy configuration
├── logs/                # Log files
├── notebook/            # Jupyter notebooks for data analysis
├── script/             # Utility scripts
└── tmp/                # Temporary files

Key Modules

Spider Module

The main spider module is responsible for crawling Hugging Face user pages and extracting relevant information. It handles:

User profile navigation
Data extraction
Rate limiting
Error handling

Data Processing

The data processing pipeline includes:

Data cleaning and normalization
JSON/CSV export functionality
Data validation
Storage management

Configuration

The config module manages:

Crawler settings
API credentials
Rate limiting parameters
Output formatting

Getting Started

Clone the repository

git clone https://github.com/yourusername/hugging-face-user-scrawler.git
cd hugging-face-user-scrawler

Install dependencies

conda env create -f config/conda.env

Configure the crawler

Update settings in config/
Set up your Hugging Face API credentials

Run the crawler

 scrapy crawl hf_user_scrawler -o ../data/result/users.jsonl

Analyze the data The result will be stored in data/result/users.jsonl. You can use the Jupyter notebooks in the notebook/ folder to analyze the data.

TODO

[ ] Add frequency control
[ ] Add proxy settings
[ ] Add a mysql pipeline
[ ] Add a sample analysis notebook

Citation

@misc{huggingface_user_crawler,
  author       = {Rongxin Ouyang},
  title        = {reycn/huggingface-user-crawler},
  year         = {2025},
  url          = {https://github.com/reycn/huggingface-user-crawler},
  note         = {Accessed: 2025-03-26},
  howpublished = {\url{https://github.com/reycn/huggingface-user-crawler}}
}

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
data		data
hf_user_scrawler		hf_user_scrawler
notebook		notebook
output/figure		output/figure
script		script
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hugging Face User Crawler

Data Link

Sample Results

Project Overview

Folder Structure

Key Modules

Spider Module

Data Processing

Configuration

Getting Started

TODO

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

CoinX114514/huggingface-user-crawler

Folders and files

Latest commit

History

Repository files navigation

Hugging Face User Crawler

Data Link

Sample Results

Project Overview

Folder Structure

Key Modules

Spider Module

Data Processing

Configuration

Getting Started

TODO

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages