prebid-integration-monitor

This project is designed to monitor Prebid integrations. It is now built using TypeScript and uses Vitest for testing.

Prerequisites

Node.js (v16 or later recommended)
npm (comes with Node.js)

Setup

Clone the repository:

git clone https://github.com/prebid/prebid-integration-monitor.git
cd prebid-integration-monitor

Install dependencies:
```
npm install
```

Project Structure

bin/: Contains executable scripts for running the CLI.
- dev.js: Entry point for development mode (uses ts-node).
- run.js: Entry point for production mode (uses compiled JS).
src/: Contains the TypeScript source code for the application.
- commands/: Contains the oclif command classes.
  - index.ts: The default command that runs when no specific command is provided. It now houses the main application logic.
- index.ts: Previously the main entry point. Now, its role is superseded by the oclif structure in bin/ and src/commands/.
- utils/: Utility scripts.
- Other .ts files for application logic.
dist/: Contains the compiled JavaScript code, generated by npm run build.
- dist/commands/: Contains compiled oclif commands.
tests/: Contains test files written using Vitest.
- example.test.ts: An example test file.
package.json: Lists project dependencies, npm scripts, and oclif configuration.
tsconfig.json: Configuration file for the TypeScript compiler.
node_modules/: Directory where npm installs project dependencies.

Development

To run the application in development mode (using the oclif development script, which leverages ts-node):

npm run dev

This executes node ./bin/dev.js which handles TypeScript execution.

Building for Production

To compile the TypeScript code to JavaScript (output will be in the dist/ directory):

npm run build

Note on tsc execution: In some environments, if tsc (the TypeScript compiler) is not found via the npm script, you might need to invoke it using npx:

npx -p typescript tsc

This command typically isn't needed if npm run build works, as npm run build should use the tsc from your project's devDependencies. The type checking can also be done using npm run build:check or npx -p typescript tsc --noEmit --module nodenext --moduleResolution nodenext src/**/*.ts.

Running in Production

After building the project (npm run build), run the compiled application using its oclif entry point:

npm start

This executes node ./bin/run.js.

Performance Optimizations

This application includes several performance optimizations for handling large-scale URL processing:

Database Performance

Advanced Indexing: Composite indexes on frequently queried columns (status, timestamp, retry_count)
SQLite Optimization: WAL mode, memory mapping, and optimized pragma settings
Prepared Statements: All database queries use prepared statements for optimal performance
Maintenance Operations: Built-in database maintenance including VACUUM, ANALYZE, and cleanup

Content Caching

Intelligent Caching: GitHub content is cached to prevent redundant HTTP requests
Memory-Efficient: LRU + LFU eviction strategy with configurable size limits
Persistent Storage: Optional file-based caching with TTL support
Cache Statistics: Real-time cache performance monitoring

Range Processing Optimization

Selective Loading: Process only specified URL ranges without loading entire files
Memory Efficient: Minimal memory footprint even with large domain lists
Fast Processing: Optimized for sub-second processing of large ranges

Performance Benchmarks

Database: 10,000+ URL queries per second
Caching: 1,000+ cache reads per second, 200+ writes per second
URL Processing: 200+ URLs per second processing rate
Memory Usage: <50MB increase for 50,000 URL datasets

Prebid Detection Accuracy Optimizations

This application includes advanced optimizations specifically designed to improve Prebid.js detection accuracy and reliability:

Enhanced Website Interaction

Smart Navigation: Progressive timeout strategy with intelligent retry logic for different error types
Realistic User Simulation: Mouse movements, clicks, and event triggers that activate ad tech loading
Dynamic Content Loading: Intelligent scrolling with pause points to allow ad content to load
Popup Dismissal: Automatic detection and dismissal of cookie banners, modals, and other blocking elements

Advanced Prebid Detection

Multi-Stage Detection: Detects Prebid instances in various initialization states (complete, partial, queue)
Enhanced Polling: Adaptive polling intervals with frame-safe error handling
Ad Tech Initialization: Waits for common ad technology signals before attempting detection
Robust Error Handling: Retries for detached frame errors and other temporary failures

Browser Authenticity

Realistic Browser Profile: Authentic user agents, viewport sizes, and browser properties
Anti-Detection: Removes automation markers and sets realistic hardware characteristics
Enhanced Stealth: Automatic permission denial and popup blocking for uninterrupted scanning

Configuration Options

Database Optimization

# Enable database maintenance
node ./bin/run.js scan --maintenance --vacuum --analyze

# Performance monitoring
node ./bin/run.js scan --verbose --logDir=performance-logs

Content Caching

# Cache is automatically enabled for GitHub sources
# Cache statistics available in verbose logs
node ./bin/run.js scan --githubRepo URL --verbose

Range Processing

# Process large lists efficiently with ranges
node ./bin/run.js scan --githubRepo URL --range="10000-20000" --batchMode

# Batch processing with optimal performance
node ./bin/run.js scan --batchMode --startUrl=1 --totalUrls=50000 --batchSize=1000

CLI Usage (oclif)

This application is now structured as an oclif (Open CLI Framework) command-line interface.

Running commands:
- In development: node ./bin/dev.js [COMMAND] [FLAGS]
- In production (after npm run build): node ./bin/run.js [COMMAND] [FLAGS]
- If the package is linked globally (npm link) or installed globally, you can use the bin name directly: app [COMMAND] [FLAGS] (Note: app is the default bin name configured in package.json).
Default Command:
- Running node ./bin/run.js (or node ./bin/dev.js) without any specific command will execute the default command defined in src/commands/index.ts. This command currently runs the main prebid integration monitoring logic.
Getting Help:
- To see general help for the CLI and a list of available commands:
```
node ./bin/run.js --help
```
  or in development:
```
node ./bin/dev.js --help
```
- For help on a specific command:
```
node ./bin/run.js [COMMAND] --help
```

`scan` Command

The scan command is used to analyze a list of websites for Prebid.js integrations and other specified ad technology libraries. It processes URLs from an input file, launches Puppeteer instances to visit these pages, and collects data, saving the results to an output directory.

Syntax:

./bin/run scan [INPUTFILE] [FLAGS...]
# or in development: node ./bin/dev.js scan [INPUTFILE] [FLAGS...]
# or using npm script: npm run prebid:scan -- [INPUTFILE] [FLAGS...] (note the -- to pass arguments to the script)

Argument:

INPUTFILE: (Optional) Path to a local input file containing URLs.
- Accepts .txt (one URL per line), .csv (URLs in the first column), or .json (extracts all string values that are URLs) files.
- This argument is required if --githubRepo is not used.
- If --githubRepo is provided, INPUTFILE is ignored.
- If neither INPUTFILE nor --githubRepo is specified, the command will show an error (unless the default src/input.txt exists and is readable).
- Defaults to src/input.txt if no other source is specified.

Flags:

--githubRepo <URL>: Specifies a public GitHub URL from which to fetch URLs.
- This can be a base repository URL (e.g., https://github.com/owner/repo) to scan for URLs within processable files (like .txt, .json, .md) in the repository root.
- Alternatively, it can be a direct link to a specific processable file within a repository (e.g., https://github.com/owner/repo/blob/main/some/path/file.txt). In this case, only the specified file will be fetched and processed.
- Example (repository): --githubRepo https://github.com/owner/repo
- Example (direct file): --githubRepo https://github.com/owner/repo/blob/main/urls.txt
--numUrls <number>: When used with --githubRepo, this flag limits the number of URLs to be extracted and processed from the repository.
- Default: 100
- Example: --numUrls 50
--puppeteerType <option>: Specifies the Puppeteer operational mode.
- Options: vanilla, cluster (default)
- vanilla: Processes URLs sequentially using a single Puppeteer browser instance.
- cluster: Uses puppeteer-cluster to process URLs in parallel, according to the concurrency settings.
--concurrency <value>: Sets the number of concurrent Puppeteer instances when using puppeteerType=cluster.
- Default: 5
--headless: Runs Puppeteer in headless mode (no UI). This is the default.
- --no-headless: Runs Puppeteer with a visible browser UI.
--monitor: Enables puppeteer-cluster's web monitoring interface (available at http://localhost:21337 by default) when puppeteerType=cluster.
- Default: false
--outputDir <value>: Specifies the directory where scan results (JSON files) will be saved.
- Default: output (in the project root)
- Results are typically saved in a subdirectory structure like output/Month/YYYY-MM-DD.json.
- Note: By default, scan results are automatically stored in the store/ directory using the format store/mmm-yyyy/yyyy-mm-dd.json (e.g., store/Apr-2025/2025-04-15.json) and appended to existing files.
--logDir <value>: Specifies the directory where log files (app.log, error.log) will be saved.
- Default: logs (in the project root)
--range <string>: Specify a line range (e.g., '10-20', '5-', '-15') to process from the input source (file, CSV, or GitHub-extracted URLs). Uses 1-based indexing. If the source is a GitHub repo, the range applies to the aggregated list of URLs extracted from all targeted files in the repo.
- Example: --range 10-50 or --range 1-
--chunkSize <number>: Process URLs in chunks of this size. This processes all URLs (whether from the full input or a specified range) but does so by loading and analyzing only chunkSize URLs at a time. Useful for very large lists of URLs to manage resources or process incrementally.
- Example: --chunkSize 50
--verbose: A boolean flag to control the verbosity of log output, especially for errors.
- Default: false
- When false (default): Error messages related to URL processing are shortened for console display (e.g., error: Error processing http://example.com : Connection timed out). Full details are still available in logs/error.log.
- When true (--verbose): Full error messages, including stack traces where available, are displayed in the console for all logs, including errors. This provides maximum detail for debugging.

Usage Examples:

Basic scan (using default input.txt and cluster mode):
```
./bin/run scan
```
(Ensure ./bin/run has execute permissions or use node ./bin/run scan)
Scan using vanilla Puppeteer:
```
./bin/run scan --puppeteerType=vanilla
```
Scan with a specific input file and output directory:
```
./bin/run scan my_urls.txt --outputDir=./my_results
```
Scan in non-headless (headed) mode:
```
./bin/run scan --no-headless
```
Scan with increased concurrency and monitoring for cluster mode:
```
./bin/run scan --concurrency=10 --monitor
```

Scan URLs from a GitHub repository:

./bin/run scan --githubRepo https://github.com/owner/repo

Scan a limited number of URLs from a GitHub repository:

./bin/run scan --githubRepo https://github.com/owner/repo --numUrls 50

Scan URLs from a local CSV file (using INPUTFILE argument):
```
./bin/run scan ./data/urls_to_scan.csv
```
Scan URLs from a local JSON file (using INPUTFILE argument):
```
./bin/run scan ./data/urls_to_scan.json
```

Scan a specific range of URLs from a large input file, in chunks:

./bin/run scan very_large_list_of_sites.txt --range 1001-2000 --chunkSize 100

Default Storage and Error Handling

By default, the scan command automatically:

Stores Prebid Data: Results are automatically saved to store/mmm-yyyy/yyyy-mm-dd.json format (e.g., store/Apr-2025/2025-04-15.json) and appended to existing files for the current date.
Logs URLs with No Prebid: URLs where no Prebid.js integration is found are automatically appended to errors/no_prebid.txt.
Logs Error URLs:
- URLs with name resolution errors (ERR_NAME_NOT_RESOLVED) are logged to errors/navigation_errors.txt in the format: url,ERROR_CODE
- Other processing errors are logged to errors/error_processing.txt with timestamp, URL, message, and error details.

Notes on URL Extraction

When processing .txt files (or GitHub files treated as text), the scanner looks for fully qualified URLs (e.g., http://example.com) and also attempts to identify and prepend https:// to schemeless domains (e.g., example.com becomes https://example.com).
For .csv files, URLs are expected to be in the first column and should be fully qualified (e.g., http://example.com). Schemeless domains from CSVs are currently skipped.
For .json files, all string values are recursively scanned. Fully qualified URLs are extracted. If a malformed JSON file is encountered, a fallback regex scan of the raw content is performed.
Entries in input sources that are malformed (e.g., htp://missing-t.com) or use unsupported schemes (e.g., ftp://) are generally skipped.

`stats:generate` Command

The stats:generate command processes stored website scan data (typically found in the ./store directory, generated by the scan command) to generate or update the api/api.json file. This JSON file contains aggregated statistics about Prebid.js usage, including version distributions and module usage, after cleaning and categorization.

Syntax:

./bin/run stats:generate
# or in development: node ./bin/dev.js stats:generate
# or using npm script (if configured): npm run prebid:stats:generate

Flags:

This command currently does not take any specific flags.

Usage Example:

Generate or update the statistics API file:
```
./bin/run stats:generate
```
(Ensure ./bin/run has execute permissions or use node ./bin/run stats:generate)

`inspect` Command

The inspect command fetches data from a given URL and stores the HTTP request and response details to a file. This is useful for archiving web content or debugging network interactions. It supports saving data in JSON format and a basic HAR (HTTP Archive) format.

Syntax:

./bin/run inspect <URL> [FLAGS...]
# or in development: node ./bin/dev.js inspect <URL> [FLAGS...]

Argument:

<URL>: (Required) The full URL to inspect (e.g., https://example.com).

Flags:

--output-dir <value>: Specifies the directory where the inspection data will be saved.
- Default: store/inspect
--format <option>: The format in which to store the inspection data.
- Options: json (default), har
- The HAR implementation is currently basic, capturing essential request/response details.
--filename <value>: The base filename for the inspection data (without extension).
- If not provided, a filename will be automatically generated based on the URL's hostname and a timestamp (e.g., example_com-YYYY-MM-DDTHH-mm-ss-SSSZ).

Usage Examples:

Inspect a URL and save as JSON to the default directory:
```
./bin/run inspect https://example.com
```

Inspect a URL and save as HAR with a custom filename and directory:

./bin/run inspect https://api.example.com/data --format har --output-dir my-inspections --filename api-data-archive

Running Tests

Basic Testing

To run the basic test suite using Vitest:

npm test

Comprehensive Validation

To run the complete validation pipeline (recommended before commits):

npm run validate:all

This runs:

TypeScript type checking
ESLint code linting
Prettier code formatting
Documentation synchronization
All unit tests
Critical integration tests
GitHub range processing validation

Specific Test Categories

# Run all tests including critical integration tests
npm run test:all

# Run integration tests only
npm run test:integration

# Run regression tests (includes performance tests)
npm run test:regression

# Run critical tests (GitHub range, CLI, Puppeteer accuracy)
npm run test:critical

Pre-commit Validation

Set up automatic validation before commits:

# Setup git hooks
npm run setup:hooks

# Manual pre-commit validation
npm run validate:pre-commit

Note on vitest execution: In some environments, if vitest is not found via the npm script, you might need to run it using npx:

npx -p vitest vitest run

Linting and Formatting

This project uses ESLint for linting and Prettier for code formatting to ensure code quality and consistency.

Configuration

ESLint: Configuration is managed in eslint.config.js. It uses @typescript-eslint/parser for TypeScript support and integrates Prettier rules to avoid conflicts.
Prettier: Configuration is managed in .prettierrc.cjs.
Ignore Files:
- .eslintignore (if you choose to create one, though ignores in eslint.config.js is preferred for ESLint v9+).
- .prettierignore is used to specify files that Prettier should not format.

Scripts

To format all supported files in the project:
```
npm run format
```
This command uses Prettier to rewrite files in place according to the rules in .prettierrc.cjs.
To lint all TypeScript files and automatically fix fixable issues:
```
npm run lint -- --fix
```
This command uses ESLint to analyze the code. The --fix flag instructs ESLint to automatically correct problems where possible. Any errors that cannot be auto-fixed will be reported in the console.
To check for linting errors without fixing:
```
npm run lint
```

It's recommended to run these scripts before committing code to maintain a clean and consistent codebase.

Logging

This application utilizes the Winston logging library to provide detailed and structured logging.

Log Files

Log files are stored in the logs/ directory, which is excluded from Git commits via .gitignore.

logs/app.log: This file contains all general application logs, including informational messages, warnings, and errors (typically info level and above). Log entries are stored in JSON format, allowing for easy parsing and querying. Each entry includes a timestamp, log level, message, and any additional metadata.
logs/error.log: This file is dedicated to error-level logs only. It provides a focused view of errors that have occurred within the application, also in JSON format. Error logs will include stack traces when available.

Console Output

In addition to file logging, messages are also output to the console:

Log messages are colorized for better readability based on their severity level (e.g., errors in red, warnings in yellow).
The console typically displays info level messages and above.
The format includes the timestamp, log level, and message, similar to the file logs.

Log Format

All log entries, whether in files or on the console, follow a consistent format:

Timestamp: YYYY-MM-DD HH:mm:ss
Level: The severity of the log (e.g., info, warn, error).
Message: The main content of the log entry.
Stack Trace: For error logs, a stack trace is included if available, aiding in debugging.

OpenTelemetry Integration

OpenTelemetry has been integrated to provide distributed tracing capabilities.

The main tracer configuration can be found in src/tracer.ts.
Log messages (both console and file) are now automatically enriched with trace_id and span_id when generated within an active trace.
The default OTLP HTTP exporter is used. You may need to configure the OTEL_EXPORTER_OTLP_ENDPOINT environment variable to point to your OpenTelemetry collector (e.g., http://localhost:4318/v1/traces).
The service name for OpenTelemetry is configured via the OTEL_SERVICE_NAME environment variable (e.g., export OTEL_SERVICE_NAME="prebid-integration-monitor"). An attempt to set this directly in src/tracer.ts using the Resource attribute resulted in a TypeScript build error (TS2693) due to OpenTelemetry package version incompatibilities, so it remains commented out.

Name		Name	Last commit message	Last commit date
Latest commit History 334 Commits
.githooks		.githooks
.github/workflows		.github/workflows
api		api
bin		bin
dist		dist
errors		errors
node_modules		node_modules
scripts		scripts
src		src
store		store
tests		tests
.agent-docs-status.json		.agent-docs-status.json
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.js		.prettierrc.js
.prettierrc.json		.prettierrc.json
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
DOCS_SYNC.md		DOCS_SYNC.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
TESTING_INTEGRATION_SUMMARY.md		TESTING_INTEGRATION_SUMMARY.md
TESTING_STRATEGY.md		TESTING_STRATEGY.md
TEST_SUITE_README.md		TEST_SUITE_README.md
batch-progress-146001-150000.json		batch-progress-146001-150000.json
eslint.config.js		eslint.config.js
known_crawler_lists		known_crawler_lists
package-lock.json		package-lock.json
package.json		package.json
resume-batch.sh		resume-batch.sh
top-domains.txt		top-domains.txt
tsconfig.json		tsconfig.json
typedoc.json		typedoc.json
vitest.config.ts		vitest.config.ts

License

prebid/prebid-integration-monitor

Folders and files

Latest commit

History

Repository files navigation

prebid-integration-monitor

Prerequisites

Setup

Project Structure

Development

Building for Production

Running in Production

Performance Optimizations

Database Performance

Content Caching

Range Processing Optimization

Performance Benchmarks

Prebid Detection Accuracy Optimizations

Enhanced Website Interaction

Advanced Prebid Detection

Browser Authenticity

Configuration Options

Database Optimization

Content Caching

Range Processing

CLI Usage (oclif)

scan Command

Default Storage and Error Handling

Notes on URL Extraction

stats:generate Command

inspect Command

Running Tests

Basic Testing

Comprehensive Validation

Specific Test Categories

Pre-commit Validation

Linting and Formatting

Configuration

Scripts

Logging

Log Files

Console Output

Log Format

OpenTelemetry Integration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

`scan` Command

`stats:generate` Command

`inspect` Command

Packages