This project is designed to monitor Prebid integrations. It is now built using TypeScript and uses Vitest for testing.
- Node.js (v16 or later recommended)
- npm (comes with Node.js)
-
Clone the repository:
git clone https://github.com/prebid/prebid-integration-monitor.git cd prebid-integration-monitor
-
Install dependencies:
npm install
bin/
: Contains executable scripts for running the CLI.dev.js
: Entry point for development mode (uses ts-node).run.js
: Entry point for production mode (uses compiled JS).
src/
: Contains the TypeScript source code for the application.commands/
: Contains the oclif command classes.index.ts
: The default command that runs when no specific command is provided. It now houses the main application logic.
index.ts
: Previously the main entry point. Now, its role is superseded by the oclif structure inbin/
andsrc/commands/
.utils/
: Utility scripts.- Other
.ts
files for application logic.
dist/
: Contains the compiled JavaScript code, generated bynpm run build
.dist/commands/
: Contains compiled oclif commands.
tests/
: Contains test files written using Vitest.example.test.ts
: An example test file.
package.json
: Lists project dependencies, npm scripts, and oclif configuration.tsconfig.json
: Configuration file for the TypeScript compiler.node_modules/
: Directory where npm installs project dependencies.
To run the application in development mode (using the oclif development script, which leverages ts-node
):
npm run dev
This executes node ./bin/dev.js
which handles TypeScript execution.
To compile the TypeScript code to JavaScript (output will be in the dist/
directory):
npm run build
Note on tsc
execution: In some environments, if tsc
(the TypeScript compiler) is not found via the npm script, you might need to invoke it using npx
:
npx -p typescript tsc
This command typically isn't needed if npm run build
works, as npm run build
should use the tsc
from your project's devDependencies
. The type checking can also be done using npm run build:check
or npx -p typescript tsc --noEmit --module nodenext --moduleResolution nodenext src/**/*.ts
.
After building the project (npm run build
), run the compiled application using its oclif entry point:
npm start
This executes node ./bin/run.js
.
This application includes several performance optimizations for handling large-scale URL processing:
- Advanced Indexing: Composite indexes on frequently queried columns (status, timestamp, retry_count)
- SQLite Optimization: WAL mode, memory mapping, and optimized pragma settings
- Prepared Statements: All database queries use prepared statements for optimal performance
- Maintenance Operations: Built-in database maintenance including VACUUM, ANALYZE, and cleanup
- Intelligent Caching: GitHub content is cached to prevent redundant HTTP requests
- Memory-Efficient: LRU + LFU eviction strategy with configurable size limits
- Persistent Storage: Optional file-based caching with TTL support
- Cache Statistics: Real-time cache performance monitoring
- Selective Loading: Process only specified URL ranges without loading entire files
- Memory Efficient: Minimal memory footprint even with large domain lists
- Fast Processing: Optimized for sub-second processing of large ranges
- Database: 10,000+ URL queries per second
- Caching: 1,000+ cache reads per second, 200+ writes per second
- URL Processing: 200+ URLs per second processing rate
- Memory Usage: <50MB increase for 50,000 URL datasets
This application includes advanced optimizations specifically designed to improve Prebid.js detection accuracy and reliability:
- Smart Navigation: Progressive timeout strategy with intelligent retry logic for different error types
- Realistic User Simulation: Mouse movements, clicks, and event triggers that activate ad tech loading
- Dynamic Content Loading: Intelligent scrolling with pause points to allow ad content to load
- Popup Dismissal: Automatic detection and dismissal of cookie banners, modals, and other blocking elements
- Multi-Stage Detection: Detects Prebid instances in various initialization states (complete, partial, queue)
- Enhanced Polling: Adaptive polling intervals with frame-safe error handling
- Ad Tech Initialization: Waits for common ad technology signals before attempting detection
- Robust Error Handling: Retries for detached frame errors and other temporary failures
- Realistic Browser Profile: Authentic user agents, viewport sizes, and browser properties
- Anti-Detection: Removes automation markers and sets realistic hardware characteristics
- Enhanced Stealth: Automatic permission denial and popup blocking for uninterrupted scanning
# Enable database maintenance
node ./bin/run.js scan --maintenance --vacuum --analyze
# Performance monitoring
node ./bin/run.js scan --verbose --logDir=performance-logs
# Cache is automatically enabled for GitHub sources
# Cache statistics available in verbose logs
node ./bin/run.js scan --githubRepo URL --verbose
# Process large lists efficiently with ranges
node ./bin/run.js scan --githubRepo URL --range="10000-20000" --batchMode
# Batch processing with optimal performance
node ./bin/run.js scan --batchMode --startUrl=1 --totalUrls=50000 --batchSize=1000
This application is now structured as an oclif (Open CLI Framework) command-line interface.
-
Running commands:
- In development:
node ./bin/dev.js [COMMAND] [FLAGS]
- In production (after
npm run build
):node ./bin/run.js [COMMAND] [FLAGS]
- If the package is linked globally (
npm link
) or installed globally, you can use the bin name directly:app [COMMAND] [FLAGS]
(Note:app
is the default bin name configured inpackage.json
).
- In development:
-
Default Command:
- Running
node ./bin/run.js
(ornode ./bin/dev.js
) without any specific command will execute the default command defined insrc/commands/index.ts
. This command currently runs the main prebid integration monitoring logic.
- Running
-
Getting Help:
- To see general help for the CLI and a list of available commands:
or in development:
node ./bin/run.js --help
node ./bin/dev.js --help
- For help on a specific command:
node ./bin/run.js [COMMAND] --help
- To see general help for the CLI and a list of available commands:
The scan
command is used to analyze a list of websites for Prebid.js integrations and other specified ad technology libraries. It processes URLs from an input file, launches Puppeteer instances to visit these pages, and collects data, saving the results to an output directory.
Syntax:
./bin/run scan [INPUTFILE] [FLAGS...]
# or in development: node ./bin/dev.js scan [INPUTFILE] [FLAGS...]
# or using npm script: npm run prebid:scan -- [INPUTFILE] [FLAGS...] (note the -- to pass arguments to the script)
Argument:
INPUTFILE
: (Optional) Path to a local input file containing URLs.- Accepts
.txt
(one URL per line),.csv
(URLs in the first column), or.json
(extracts all string values that are URLs) files. - This argument is required if
--githubRepo
is not used. - If
--githubRepo
is provided,INPUTFILE
is ignored. - If neither
INPUTFILE
nor--githubRepo
is specified, the command will show an error (unless the defaultsrc/input.txt
exists and is readable). - Defaults to
src/input.txt
if no other source is specified.
- Accepts
Flags:
--githubRepo <URL>
: Specifies a public GitHub URL from which to fetch URLs.- This can be a base repository URL (e.g.,
https://github.com/owner/repo
) to scan for URLs within processable files (like.txt
,.json
,.md
) in the repository root. - Alternatively, it can be a direct link to a specific processable file within a repository (e.g.,
https://github.com/owner/repo/blob/main/some/path/file.txt
). In this case, only the specified file will be fetched and processed. - Example (repository):
--githubRepo https://github.com/owner/repo
- Example (direct file):
--githubRepo https://github.com/owner/repo/blob/main/urls.txt
- This can be a base repository URL (e.g.,
--numUrls <number>
: When used with--githubRepo
, this flag limits the number of URLs to be extracted and processed from the repository.- Default:
100
- Example:
--numUrls 50
- Default:
--puppeteerType <option>
: Specifies the Puppeteer operational mode.- Options:
vanilla
,cluster
(default) vanilla
: Processes URLs sequentially using a single Puppeteer browser instance.cluster
: Usespuppeteer-cluster
to process URLs in parallel, according to the concurrency settings.
- Options:
--concurrency <value>
: Sets the number of concurrent Puppeteer instances when usingpuppeteerType=cluster
.- Default:
5
- Default:
--headless
: Runs Puppeteer in headless mode (no UI). This is the default.--no-headless
: Runs Puppeteer with a visible browser UI.
--monitor
: Enablespuppeteer-cluster
's web monitoring interface (available athttp://localhost:21337
by default) whenpuppeteerType=cluster
.- Default:
false
- Default:
--outputDir <value>
: Specifies the directory where scan results (JSON files) will be saved.- Default:
output
(in the project root) - Results are typically saved in a subdirectory structure like
output/Month/YYYY-MM-DD.json
. - Note: By default, scan results are automatically stored in the
store/
directory using the formatstore/mmm-yyyy/yyyy-mm-dd.json
(e.g.,store/Apr-2025/2025-04-15.json
) and appended to existing files.
- Default:
--logDir <value>
: Specifies the directory where log files (app.log
,error.log
) will be saved.- Default:
logs
(in the project root)
- Default:
--range <string>
: Specify a line range (e.g., '10-20', '5-', '-15') to process from the input source (file, CSV, or GitHub-extracted URLs). Uses 1-based indexing. If the source is a GitHub repo, the range applies to the aggregated list of URLs extracted from all targeted files in the repo.- Example:
--range 10-50
or--range 1-
- Example:
--chunkSize <number>
: Process URLs in chunks of this size. This processes all URLs (whether from the full input or a specified range) but does so by loading and analyzing onlychunkSize
URLs at a time. Useful for very large lists of URLs to manage resources or process incrementally.- Example:
--chunkSize 50
- Example:
--verbose
: A boolean flag to control the verbosity of log output, especially for errors.- Default:
false
- When
false
(default): Error messages related to URL processing are shortened for console display (e.g.,error: Error processing http://example.com : Connection timed out
). Full details are still available inlogs/error.log
. - When
true
(--verbose
): Full error messages, including stack traces where available, are displayed in the console for all logs, including errors. This provides maximum detail for debugging.
- Default:
Usage Examples:
-
Basic scan (using default
input.txt
and cluster mode):./bin/run scan
(Ensure
./bin/run
has execute permissions or usenode ./bin/run scan
) -
Scan using vanilla Puppeteer:
./bin/run scan --puppeteerType=vanilla
-
Scan with a specific input file and output directory:
./bin/run scan my_urls.txt --outputDir=./my_results
-
Scan in non-headless (headed) mode:
./bin/run scan --no-headless
-
Scan with increased concurrency and monitoring for cluster mode:
./bin/run scan --concurrency=10 --monitor
-
Scan URLs from a GitHub repository:
./bin/run scan --githubRepo https://github.com/owner/repo
-
Scan a limited number of URLs from a GitHub repository:
./bin/run scan --githubRepo https://github.com/owner/repo --numUrls 50
-
Scan URLs from a local CSV file (using INPUTFILE argument):
./bin/run scan ./data/urls_to_scan.csv
-
Scan URLs from a local JSON file (using INPUTFILE argument):
./bin/run scan ./data/urls_to_scan.json
-
Scan a specific range of URLs from a large input file, in chunks:
./bin/run scan very_large_list_of_sites.txt --range 1001-2000 --chunkSize 100
By default, the scan command automatically:
-
Stores Prebid Data: Results are automatically saved to
store/mmm-yyyy/yyyy-mm-dd.json
format (e.g.,store/Apr-2025/2025-04-15.json
) and appended to existing files for the current date. -
Logs URLs with No Prebid: URLs where no Prebid.js integration is found are automatically appended to
errors/no_prebid.txt
. -
Logs Error URLs:
- URLs with name resolution errors (
ERR_NAME_NOT_RESOLVED
) are logged toerrors/navigation_errors.txt
in the format:url,ERROR_CODE
- Other processing errors are logged to
errors/error_processing.txt
with timestamp, URL, message, and error details.
- URLs with name resolution errors (
- When processing
.txt
files (or GitHub files treated as text), the scanner looks for fully qualified URLs (e.g.,http://example.com
) and also attempts to identify and prependhttps://
to schemeless domains (e.g.,example.com
becomeshttps://example.com
). - For
.csv
files, URLs are expected to be in the first column and should be fully qualified (e.g.,http://example.com
). Schemeless domains from CSVs are currently skipped. - For
.json
files, all string values are recursively scanned. Fully qualified URLs are extracted. If a malformed JSON file is encountered, a fallback regex scan of the raw content is performed. - Entries in input sources that are malformed (e.g.,
htp://missing-t.com
) or use unsupported schemes (e.g.,ftp://
) are generally skipped.
The stats:generate
command processes stored website scan data (typically found in the ./store
directory, generated by the scan
command) to generate or update the api/api.json
file. This JSON file contains aggregated statistics about Prebid.js usage, including version distributions and module usage, after cleaning and categorization.
Syntax:
./bin/run stats:generate
# or in development: node ./bin/dev.js stats:generate
# or using npm script (if configured): npm run prebid:stats:generate
Flags:
This command currently does not take any specific flags.
Usage Example:
- Generate or update the statistics API file:
(Ensure
./bin/run stats:generate
./bin/run
has execute permissions or usenode ./bin/run stats:generate
)
The inspect
command fetches data from a given URL and stores the HTTP request and response details to a file. This is useful for archiving web content or debugging network interactions. It supports saving data in JSON format and a basic HAR (HTTP Archive) format.
Syntax:
./bin/run inspect <URL> [FLAGS...]
# or in development: node ./bin/dev.js inspect <URL> [FLAGS...]
Argument:
<URL>
: (Required) The full URL to inspect (e.g.,https://example.com
).
Flags:
--output-dir <value>
: Specifies the directory where the inspection data will be saved.- Default:
store/inspect
- Default:
--format <option>
: The format in which to store the inspection data.- Options:
json
(default),har
- The HAR implementation is currently basic, capturing essential request/response details.
- Options:
--filename <value>
: The base filename for the inspection data (without extension).- If not provided, a filename will be automatically generated based on the URL's hostname and a timestamp (e.g.,
example_com-YYYY-MM-DDTHH-mm-ss-SSSZ
).
- If not provided, a filename will be automatically generated based on the URL's hostname and a timestamp (e.g.,
Usage Examples:
-
Inspect a URL and save as JSON to the default directory:
./bin/run inspect https://example.com
-
Inspect a URL and save as HAR with a custom filename and directory:
./bin/run inspect https://api.example.com/data --format har --output-dir my-inspections --filename api-data-archive
To run the basic test suite using Vitest:
npm test
To run the complete validation pipeline (recommended before commits):
npm run validate:all
This runs:
- TypeScript type checking
- ESLint code linting
- Prettier code formatting
- Documentation synchronization
- All unit tests
- Critical integration tests
- GitHub range processing validation
# Run all tests including critical integration tests
npm run test:all
# Run integration tests only
npm run test:integration
# Run regression tests (includes performance tests)
npm run test:regression
# Run critical tests (GitHub range, CLI, Puppeteer accuracy)
npm run test:critical
Set up automatic validation before commits:
# Setup git hooks
npm run setup:hooks
# Manual pre-commit validation
npm run validate:pre-commit
Note on vitest
execution: In some environments, if vitest
is not found via the npm script, you might need to run it using npx
:
npx -p vitest vitest run
This project uses ESLint for linting and Prettier for code formatting to ensure code quality and consistency.
- ESLint: Configuration is managed in
eslint.config.js
. It uses@typescript-eslint/parser
for TypeScript support and integrates Prettier rules to avoid conflicts. - Prettier: Configuration is managed in
.prettierrc.cjs
. - Ignore Files:
.eslintignore
(if you choose to create one, thoughignores
ineslint.config.js
is preferred for ESLint v9+)..prettierignore
is used to specify files that Prettier should not format.
-
To format all supported files in the project:
npm run format
This command uses Prettier to rewrite files in place according to the rules in
.prettierrc.cjs
. -
To lint all TypeScript files and automatically fix fixable issues:
npm run lint -- --fix
This command uses ESLint to analyze the code. The
--fix
flag instructs ESLint to automatically correct problems where possible. Any errors that cannot be auto-fixed will be reported in the console. -
To check for linting errors without fixing:
npm run lint
It's recommended to run these scripts before committing code to maintain a clean and consistent codebase.
This application utilizes the Winston logging library to provide detailed and structured logging.
Log files are stored in the logs/
directory, which is excluded from Git commits via .gitignore
.
logs/app.log
: This file contains all general application logs, including informational messages, warnings, and errors (typicallyinfo
level and above). Log entries are stored in JSON format, allowing for easy parsing and querying. Each entry includes a timestamp, log level, message, and any additional metadata.logs/error.log
: This file is dedicated to error-level logs only. It provides a focused view of errors that have occurred within the application, also in JSON format. Error logs will include stack traces when available.
In addition to file logging, messages are also output to the console:
- Log messages are colorized for better readability based on their severity level (e.g., errors in red, warnings in yellow).
- The console typically displays
info
level messages and above. - The format includes the timestamp, log level, and message, similar to the file logs.
All log entries, whether in files or on the console, follow a consistent format:
- Timestamp:
YYYY-MM-DD HH:mm:ss
- Level: The severity of the log (e.g.,
info
,warn
,error
). - Message: The main content of the log entry.
- Stack Trace: For error logs, a stack trace is included if available, aiding in debugging.
OpenTelemetry has been integrated to provide distributed tracing capabilities.
- The main tracer configuration can be found in
src/tracer.ts
. - Log messages (both console and file) are now automatically enriched with
trace_id
andspan_id
when generated within an active trace. - The default OTLP HTTP exporter is used. You may need to configure the
OTEL_EXPORTER_OTLP_ENDPOINT
environment variable to point to your OpenTelemetry collector (e.g.,http://localhost:4318/v1/traces
). - The service name for OpenTelemetry is configured via the
OTEL_SERVICE_NAME
environment variable (e.g.,export OTEL_SERVICE_NAME="prebid-integration-monitor"
). An attempt to set this directly insrc/tracer.ts
using theResource
attribute resulted in a TypeScript build error (TS2693) due to OpenTelemetry package version incompatibilities, so it remains commented out.