Skip to content

Add progress tracker for identifying source files #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Pennycook
Copy link
Contributor

@Pennycook Pennycook commented Apr 23, 2025

If a code base contains many files (e.g., because a build directory was accidentally included) it can take a long time to identify the source files in the code base. This commit adds a progress tracker with tqdm to make it clear that CBI is doing something and hasn't just stalled.

We don't get a progress bar, because the number of files in the code base is not known a priori, but tqdm will print the number of files found so far.

Related issues

  • N/A

Proposed changes

  • Build up the filenames set programmatically instead of using set(codebase) so that we can track progress.
  • Wrap most of the filenames construction in tqdm.

Note that the addition of files listed explicitly in the configuration (L176-L179) is not tracked. In my testing, these three lines basically take no time at all. The reason that set(codebase) takes so long is that the CodeBase iterator calls rglob(*) and checks every path in the directory where codebasin is run -- so time scales with the number of files in the directory, not the number of files in the compile_commands.json file. It may be possible to improve the performance of this iterator, but I think we can treat that as orthogonal to this UX improvement.

If a source directory contains many files (e.g., because a build directory was
accidentally included) it can take a long time to identify the source files in
the code base. This commit adds a progress tracker with tqdm to make it clear
that CBI is doing something and hasn't just stalled.

We don't get a progress bar, because the number of files in the code base is
not known a priori, but tqdm will print the number of files found so far.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added the ux Issues and PRs pertaining to user experience label Apr 23, 2025
@Pennycook Pennycook added this to the 2.0.0 milestone Apr 23, 2025
@Pennycook Pennycook requested a review from laserkelvin April 23, 2025 09:52
Copy link
Contributor

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is definitely a UX improvement - that said, I don't know if it's just me, but progress bars to the unknown are also kind of irking because I would like to know what the estimated time is, as opposed to just the rate. I definitely think that's out of scope for this PR, but probably worth keeping tabs on whether this refactor can be made.

The alternative perspective is that if the build directory is accidentally included, could we warn the user in case that is not their intention. The bad thing with taking a long (and unknown) amount of time is that it'll take a long time for the user to realize that they made a mistake and I think if we can predict that ahead of time it would be better UX as opposed to just an indication of rate.

@Pennycook
Copy link
Contributor Author

I think this is definitely a UX improvement - that said, I don't know if it's just me, but progress bars to the unknown are also kind of irking because I would like to know what the estimated time is, as opposed to just the rate. I definitely think that's out of scope for this PR, but probably worth keeping tabs on whether this refactor can be made.

Yeah, I agree. We can't ever know the number of files in the code base without evaluating it, so we can't track progress that way. If we instead tracked how many files we were checking, we might be able to add a bar. Getting a list of all the files in the current directory is easy, but figuring out whether they're part of the code base is hard.

The alternative perspective is that if the build directory is accidentally included, could we warn the user in case that is not their intention. The bad thing with taking a long (and unknown) amount of time is that it'll take a long time for the user to realize that they made a mistake and I think if we can predict that ahead of time it would be better UX as opposed to just an indication of rate.

I thought about this, but the best we'd be able to do is have some heuristic like "this looks like a build directory" (e.g., based on name). There would probably be some false positives. But you're right, some way to say "Hey, you're about to analyze 10,000 files, is that what you expected?" might be useful 😆 .

@laserkelvin
Copy link
Contributor

Yeah, I agree. We can't ever know the number of files in the code base without evaluating it, so we can't track progress that way. If we instead tracked how many files we were checking, we might be able to add a bar. Getting a list of all the files in the current directory is easy, but figuring out whether they're part of the code base is hard.

I think providing an upper limit (i.e. materializing the rglob into a list) might be helpful: you know that your run will take up to that long, so you immediately have the "well that's not what I thought" reaction if it seems to take absurdly long.

I thought about this, but the best we'd be able to do is have some heuristic like "this looks like a build directory" (e.g., based on name). There would probably be some false positives. But you're right, some way to say "Hey, you're about to analyze 10,000 files, is that what you expected?" might be useful 😆 .

Yeah I think just a warning message that's emitted if a certain patterns are matched are good enough: like you said, it might come up with false positives but it's like a good "you better know what you're doing"

Instead of calling the CodeBase iterator directly, finder now splits the scan
into two steps:

1. Build a list of all the files that might be in the code base.
2. Determine which files are in the code base.

The first step is a lot quicker than the second one, and tqdm can use the
length of the list to create a progress bar.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook
Copy link
Contributor Author

I think providing an upper limit (i.e. materializing the rglob into a list) might be helpful: you know that your run will take up to that long, so you immediately have the "well that's not what I thought" reaction if it seems to take absurdly long.

I've had to split the difference here.

Materializing the rglob into a list takes a non-trivial amount of time if you run codebasin in a silly place, so I've wrapped that step in a tqdm with an unknown bound. It prints "Scanning current directory" and then keeps you posted regarding how many files it's found. This step takes ~4 seconds for the llama.cpp test including the build directory and all test directories, and is almost instant for the HACC stress test.

Determining which of the files in the list is actually a source file we care about is then implemented as a second pass, wrapped in tqdm with a known bound (i.e., the number of files we found in the first step). It prints "Identifying source files" and shows how far it is in real terms. This step takes ~30 seconds for the llama.cpp test and a few seconds for the HACC stress test.

Yeah I think just a warning message that's emitted if a certain patterns are matched are good enough: like you said, it might come up with false positives but it's like a good "you better know what you're doing"

I'll open a separate issue to track this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ux Issues and PRs pertaining to user experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants