Add progress tracker for identifying source files #188

Pennycook · 2025-04-23T09:52:33Z

If a code base contains many files (e.g., because a build directory was accidentally included) it can take a long time to identify the source files in the code base. This commit adds a progress tracker with tqdm to make it clear that CBI is doing something and hasn't just stalled.

We don't get a progress bar, because the number of files in the code base is not known a priori, but tqdm will print the number of files found so far.

Related issues

N/A

Proposed changes

Build up the filenames set programmatically instead of using set(codebase) so that we can track progress.
Wrap most of the filenames construction in tqdm.

Note that the addition of files listed explicitly in the configuration (L176-L179) is not tracked. In my testing, these three lines basically take no time at all. The reason that set(codebase) takes so long is that the CodeBase iterator calls rglob(*) and checks every path in the directory where codebasin is run -- so time scales with the number of files in the directory, not the number of files in the compile_commands.json file. It may be possible to improve the performance of this iterator, but I think we can treat that as orthogonal to this UX improvement.

If a source directory contains many files (e.g., because a build directory was accidentally included) it can take a long time to identify the source files in the code base. This commit adds a progress tracker with tqdm to make it clear that CBI is doing something and hasn't just stalled. We don't get a progress bar, because the number of files in the code base is not known a priori, but tqdm will print the number of files found so far. Signed-off-by: John Pennycook <[email protected]>

laserkelvin

I think this is definitely a UX improvement - that said, I don't know if it's just me, but progress bars to the unknown are also kind of irking because I would like to know what the estimated time is, as opposed to just the rate. I definitely think that's out of scope for this PR, but probably worth keeping tabs on whether this refactor can be made.

The alternative perspective is that if the build directory is accidentally included, could we warn the user in case that is not their intention. The bad thing with taking a long (and unknown) amount of time is that it'll take a long time for the user to realize that they made a mistake and I think if we can predict that ahead of time it would be better UX as opposed to just an indication of rate.

Pennycook · 2025-04-23T15:59:18Z

I think this is definitely a UX improvement - that said, I don't know if it's just me, but progress bars to the unknown are also kind of irking because I would like to know what the estimated time is, as opposed to just the rate. I definitely think that's out of scope for this PR, but probably worth keeping tabs on whether this refactor can be made.

Yeah, I agree. We can't ever know the number of files in the code base without evaluating it, so we can't track progress that way. If we instead tracked how many files we were checking, we might be able to add a bar. Getting a list of all the files in the current directory is easy, but figuring out whether they're part of the code base is hard.

The alternative perspective is that if the build directory is accidentally included, could we warn the user in case that is not their intention. The bad thing with taking a long (and unknown) amount of time is that it'll take a long time for the user to realize that they made a mistake and I think if we can predict that ahead of time it would be better UX as opposed to just an indication of rate.

I thought about this, but the best we'd be able to do is have some heuristic like "this looks like a build directory" (e.g., based on name). There would probably be some false positives. But you're right, some way to say "Hey, you're about to analyze 10,000 files, is that what you expected?" might be useful 😆 .

laserkelvin · 2025-04-23T16:10:50Z

Yeah, I agree. We can't ever know the number of files in the code base without evaluating it, so we can't track progress that way. If we instead tracked how many files we were checking, we might be able to add a bar. Getting a list of all the files in the current directory is easy, but figuring out whether they're part of the code base is hard.

I think providing an upper limit (i.e. materializing the rglob into a list) might be helpful: you know that your run will take up to that long, so you immediately have the "well that's not what I thought" reaction if it seems to take absurdly long.

I thought about this, but the best we'd be able to do is have some heuristic like "this looks like a build directory" (e.g., based on name). There would probably be some false positives. But you're right, some way to say "Hey, you're about to analyze 10,000 files, is that what you expected?" might be useful 😆 .

Yeah I think just a warning message that's emitted if a certain patterns are matched are good enough: like you said, it might come up with false positives but it's like a good "you better know what you're doing"

Instead of calling the CodeBase iterator directly, finder now splits the scan into two steps: 1. Build a list of all the files that might be in the code base. 2. Determine which files are in the code base. The first step is a lot quicker than the second one, and tqdm can use the length of the list to create a progress bar. Signed-off-by: John Pennycook <[email protected]>

Pennycook · 2025-04-24T09:30:10Z

I think providing an upper limit (i.e. materializing the rglob into a list) might be helpful: you know that your run will take up to that long, so you immediately have the "well that's not what I thought" reaction if it seems to take absurdly long.

I've had to split the difference here.

Materializing the rglob into a list takes a non-trivial amount of time if you run codebasin in a silly place, so I've wrapped that step in a tqdm with an unknown bound. It prints "Scanning current directory" and then keeps you posted regarding how many files it's found. This step takes ~4 seconds for the llama.cpp test including the build directory and all test directories, and is almost instant for the HACC stress test.

Determining which of the files in the list is actually a source file we care about is then implemented as a second pass, wrapped in tqdm with a known bound (i.e., the number of files we found in the first step). It prints "Identifying source files" and shows how far it is in real terms. This step takes ~30 seconds for the llama.cpp test and a few seconds for the HACC stress test.

Yeah I think just a warning message that's emitted if a certain patterns are matched are good enough: like you said, it might come up with false positives but it's like a good "you better know what you're doing"

I'll open a separate issue to track this.

Pennycook added the ux Issues and PRs pertaining to user experience label Apr 23, 2025

Pennycook added this to the 2.0.0 milestone Apr 23, 2025

Pennycook requested a review from laserkelvin April 23, 2025 09:52

laserkelvin approved these changes Apr 23, 2025

View reviewed changes

Pennycook merged commit adeaca1 into P3HPC:main Apr 30, 2025
4 checks passed

Pennycook deleted the feature/finder-progress branch April 30, 2025 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add progress tracker for identifying source files #188

Add progress tracker for identifying source files #188

Uh oh!

Pennycook commented Apr 23, 2025 •

edited

Loading

Uh oh!

laserkelvin left a comment

Uh oh!

Pennycook commented Apr 23, 2025

Uh oh!

laserkelvin commented Apr 23, 2025

Uh oh!

Pennycook commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

Add progress tracker for identifying source files #188

Add progress tracker for identifying source files #188

Uh oh!

Conversation

Pennycook commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues

Proposed changes

Uh oh!

laserkelvin left a comment

Choose a reason for hiding this comment

Uh oh!

Pennycook commented Apr 23, 2025

Uh oh!

laserkelvin commented Apr 23, 2025

Uh oh!

Pennycook commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

Pennycook commented Apr 23, 2025 •

edited

Loading