feat(scraper): add scraping status dashboard #39

alissawu · 2025-10-13T04:13:59Z

Closes #38
📌 What’s Changed

Added scraper dashboard for monitoring job and error status
Wrote queries to obtain job, error data, with joining to cross reference errors to job types
Added HTML escaping to prevent XSS vulnerability
Added seed.ts file with simulated data, can seed into database with bun run seed
Implemented visual stats bars with job/error breakdowns
Added client-side filtering by job type, status, and error type
Added URL search for both jobs and errors
Added pagination (12 items per page)
Added expandable error details for failed jobs
Added auto-refresh (30 seconds) with manual refresh button

Screen.Recording.2025-10-13.at.12.32.51.AM-2.mp4

✅ Actions

Possibly make "failed jobs" more prominent by highlighting it, or putting a red dot / something to indicate it is expandable
Add authentication - basic user/password authentication or cloudfare access, only haven't done bc I'm not sure which the team needs.
Make css pixel values less hardcoded?
Make auto-refresh timeframe an input? (30sec rn) (i think this is fine tbh)

📝 Notes for Reviewer

To run:
navigate to scraper directory
bun run db:generate
bun run db:migrate:local
bun run seed (this seeds ur local db w the generated data)
bun run dev
go to the wrangler:info url in terminal
Questions / issues

The biome check does not pass because of dangerouslySetInnerHTML, but this is intentional, since we need to inject client-side JS for the dashboard refresh and pagination. It's static and server-controlled, we don't embed any dynamic user input, so it's safe from XSS attacks. I also added HTML escaping to further prevent XSS attacks just for future safety. I added a warning suppressor, at the end of components.tsx.
Notes on assumptions:
I used innerJoin for the error table (on jobs) when querying errors. This assumes every error has a valid job. Ie if an error exists, it corresponds to a job in the job database
- The issue is if an error doesn't correspond to a job, it won't be queried. However I feel like an error unrelated to a job got resolved - not sure about this assumption.
- There should be error deletion handling. When an error is resolved, it should probably be deleted. If a job is completed, it shouldn’t retain the errors (imo). This is not implemented yet though, so idk how to insure around the edge cases.
Currently, I've set a query amount. I’ve set jobs to 500 and errors to 100. This can be changed of course.
I made the decision to make filtering etc client side bc 1) faster (no network round trip) and 2) lower D1 query cost, as I’m not sure how many queries per day we r expecting. I’m assuming there will be under like 20k jobs / errors to query, so client-side is still extremely fast.
Right now, auto-refresh (30 sec) is an option, but by default, refresh is a manual button. This is to save query costs, and also it's fine because auto refresh can be turned on. I can change auto to default though if wanted.
CSS heights etc are hardcoded, may make relative later. We're all on computers though so I don't think it'll vary that much, but tbd

…status

…endered static content that doesn't embed client side input

chenxin-yan

Great job! looking look with things to consider

chenxin-yan · 2025-10-13T07:26:43Z

apps/scraper/src/dashboard/components.tsx

+  const REFRESH_INTERVAL = 30000; // 30 sec
+  const ITEMS_PER_PAGE = 12;
+  // goes inside <script> tags
+  const scriptContent = `


it is not ideal and elegant to have large js code as string like this. you could move it to a separate js file and run it as service worker on user's browser. I haven't look into how to do it with cloudflare workers and hono. You can look Into it.

Also, currently we are long polling data from the database which is also not ideal. You should look into cloudflare durable object to use web socket for real time update instead of long polling. Let me know if you need any help with it

chenxin-yan · 2025-10-13T07:42:00Z

to answer some of your questions:

Possibly make "failed jobs" more prominent by highlighting it, or putting a red dot / something to indicate it is expandable

its totally up to you. Looks good so far.

Add authentication - basic user/password authentication or cloudfare access, only haven't done bc I'm not sure which the team needs.

you can just ignore auth for now, as there is no sensitive info displayed in the dashboard and dashboard is purly presentational. it should be fine.

Make css pixel values less hardcoded?

Its up to you

Make auto-refresh timeframe an input? (30sec rn) (i think this is fine tbh)

as I suggested in code review, checkout durable object and we can have a web socket connection to provide real time update instead of long polling

The issue is if an error doesn't correspond to a job, it won't be queried. However I feel like an error unrelated to a job got resolved - not sure about this assumption.

you can safely make this assumption as we will not delete any job, and when creating error record the job must exists

There should be error deletion handling. When an error is resolved, it should probably be deleted. If a job is completed, it shouldn’t retain the errors (imo). This is not implemented yet though, so idk how to insure around the edge cases.

imo, we shouldn’t delete any error record just for monitoring/debugging purposes. e.g. checking how much retries did a given job take to succeed and the error details and whatnot.

I made the decision to make filtering etc client side bc 1) faster (no network round trip) and 2) lower D1 query cost, as I’m not sure how many queries per day we r expecting. I’m assuming there will be under like 20k jobs / errors to query, so client-side is still extremely fast.

Yes, cilent side filtering is good for this case.

CSS heights etc are hardcoded, may make relative later. We're all on computers though so I don't think it'll vary that much, but tbd

It's up to you. Feel free to hold off on this for now.

…, just committing so I don't lose the code.

chenxin-yan · 2025-10-30T13:06:59Z

🤖 Manus AI Review

Summary

This PR adds a scraping status dashboard to the scraper app, including a new client-side JavaScript file for the dashboard UI, a seed script to populate the database with sample data, and updates to package.json for new scripts. The dashboard supports live data fetching, filtering, pagination, expandable error details, and visual stats bars for jobs and errors.

Issues

🟡 MEDIUM: The dashboard.js fetches data from the root path '/' expecting JSON data. This assumes the server serves JSON at this endpoint, which may conflict with other routes or static assets. There is no fallback or configurable API endpoint, which may reduce flexibility.

Location: apps/scraper/public/dashboard.js

🟡 MEDIUM: The client-side code stores all jobs and errors in memory and performs filtering and pagination on the client. While this works for moderate data sizes, it may cause performance issues or high memory usage if the dataset grows large.

Location: apps/scraper/public/dashboard.js

🟡 MEDIUM: The pagination logic wraps around when clicking previous on the first page or next on the last page, which may be unexpected UX behavior. For example, clicking 'prev' on page 1 jumps to the last page instead of disabling the button or staying on page 1.

Location: apps/scraper/public/dashboard.js

🟡 MEDIUM: The error expansion rows are inserted directly as sibling rows in the table without ARIA or accessibility considerations, which may impact screen reader users.

Location: apps/scraper/public/dashboard.js

🟢 LOW: The escapeHtml function uses DOM methods which is good, but the code uses string concatenation to build HTML strings which can be error-prone and harder to maintain. Using templating or safer DOM manipulation methods would improve maintainability.

Location: apps/scraper/public/dashboard.js

🟢 LOW: The seed.ts script contains a large amount of hardcoded seed data. While useful for development, it may be better to generate data programmatically or split into smaller chunks for maintainability.

Location: apps/scraper/scripts/seed.ts

🟢 LOW: The fetchAndUpdate function alerts on fetch failure, which may be intrusive. Consider using a less disruptive UI notification.

Location: apps/scraper/public/dashboard.js

Suggestions

💡 Make the API endpoint for fetching dashboard data configurable or use a dedicated endpoint (e.g., '/api/dashboard') to avoid conflicts and improve clarity.

Location: apps/scraper/public/dashboard.js

💡 Consider implementing server-side filtering and pagination to improve performance and scalability for large datasets.

Location: apps/scraper/public/dashboard.js

💡 Update pagination logic to disable navigation buttons at boundaries instead of wrapping around, to align with common UX patterns.

Location: apps/scraper/public/dashboard.js

💡 Improve accessibility of expandable error rows by adding appropriate ARIA attributes and keyboard navigation support.

Location: apps/scraper/public/dashboard.js

💡 Refactor HTML rendering in dashboard.js to use safer DOM manipulation or templating libraries to reduce risk of injection and improve maintainability.

Location: apps/scraper/public/dashboard.js

💡 Consider programmatic generation or modularization of seed data in seed.ts to improve maintainability and reduce file size.

Location: apps/scraper/scripts/seed.ts

💡 Replace alert in fetchAndUpdate with a non-blocking UI notification to improve user experience.

Location: apps/scraper/public/dashboard.js

Positive Notes

✅ The dashboard implements client-side filtering, pagination, and expandable error details, providing a rich interactive UI.

✅ Use of escapeHtml function to sanitize text content helps prevent XSS vulnerabilities.

✅ The stats bars provide a clear visual summary of job and error statuses with legends and tooltips.

✅ The code includes error handling for fetch failures and logs useful messages to the console.

✅ The seed script provides comprehensive sample data covering various job types and statuses, aiding development and testing.

✅ The auto-refresh toggle and manual refresh button give users control over data updates.

✅ Pagination buttons are disabled appropriately when no results are present, maintaining consistent UI layout.

Overall Assessment

This PR delivers a functional and user-friendly scraping status dashboard with good attention to security via escaping and error handling. The client-side architecture is straightforward and aligns with the project's existing patterns. However, there are some medium-impact issues related to scalability, UX behavior in pagination, and accessibility that should be addressed to improve maintainability and user experience. The seed data is extensive but could benefit from modularization. Overall, the quality is good and the feature is a valuable addition, but some refinements are recommended before merging.

This review was automatically generated by Manus AI. Please review and verify before committing to architectural decisions.

alissawu added 8 commits October 12, 2025 23:51

feat(scraper): add scraper dashboard with job and error monitoring / …

a6ee4f3

…status

Deleted my generated migration files

daf9849

fixed incorrect css calculation for the table fixed height

3b450b6

Scraper PR: Scraping Dashboard to show status of jobs, errors

6b4375a

Added biome ignore for dangerouslysetinnerhtml, as it's server-side r…

68c1384

…endered static content that doesn't embed client side input

renamed data to db_data in index.ts to avoid future confusion

531477d

Added new error examples, ensuring every job type has an error

6cbe182

Deleted generated migration files

2f8fada

chenxin-yan changed the title ~~Scraper status dashboard~~ feat(scraper): add scraping status dashboard Oct 13, 2025

Merge branch 'main' into scraper-status-dashboard

9d40942

chenxin-yan requested changes Oct 13, 2025

View reviewed changes

minor cleanups and move scripts outside of src

d6cfdf5

fix import

ef55fed

chenxin-yan force-pushed the main branch 4 times, most recently from 7072356 to 503ca3f Compare October 19, 2025 02:11

Served js as a static asset in a new file. Working on websocket stuff…

2c94b5c

…, just committing so I don't lose the code.

chenxin-yan force-pushed the main branch 2 times, most recently from deef787 to 1d1f688 Compare October 31, 2025 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(scraper): add scraping status dashboard #39

feat(scraper): add scraping status dashboard #39

alissawu commented Oct 13, 2025 •

edited by chenxin-yan

Loading

Uh oh!

chenxin-yan left a comment

Uh oh!

chenxin-yan Oct 13, 2025

Uh oh!

chenxin-yan commented Oct 13, 2025

Uh oh!

chenxin-yan commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(scraper): add scraping status dashboard #39

Are you sure you want to change the base?

feat(scraper): add scraping status dashboard #39

Conversation

alissawu commented Oct 13, 2025 • edited by chenxin-yan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenxin-yan left a comment

Choose a reason for hiding this comment

Uh oh!

chenxin-yan Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

chenxin-yan commented Oct 13, 2025

Uh oh!

chenxin-yan commented Oct 30, 2025

Summary

Issues

Suggestions

Positive Notes

Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alissawu commented Oct 13, 2025 •

edited by chenxin-yan

Loading