-
Notifications
You must be signed in to change notification settings - Fork 10
Add progress tracker for identifying source files #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
If a source directory contains many files (e.g., because a build directory was accidentally included) it can take a long time to identify the source files in the code base. This commit adds a progress tracker with tqdm to make it clear that CBI is doing something and hasn't just stalled. We don't get a progress bar, because the number of files in the code base is not known a priori, but tqdm will print the number of files found so far. Signed-off-by: John Pennycook <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is definitely a UX improvement - that said, I don't know if it's just me, but progress bars to the unknown are also kind of irking because I would like to know what the estimated time is, as opposed to just the rate. I definitely think that's out of scope for this PR, but probably worth keeping tabs on whether this refactor can be made.
The alternative perspective is that if the build directory is accidentally included, could we warn the user in case that is not their intention. The bad thing with taking a long (and unknown) amount of time is that it'll take a long time for the user to realize that they made a mistake and I think if we can predict that ahead of time it would be better UX as opposed to just an indication of rate.
Yeah, I agree. We can't ever know the number of files in the code base without evaluating it, so we can't track progress that way. If we instead tracked how many files we were checking, we might be able to add a bar. Getting a list of all the files in the current directory is easy, but figuring out whether they're part of the code base is hard.
I thought about this, but the best we'd be able to do is have some heuristic like "this looks like a build directory" (e.g., based on name). There would probably be some false positives. But you're right, some way to say "Hey, you're about to analyze 10,000 files, is that what you expected?" might be useful 😆 . |
I think providing an upper limit (i.e. materializing the
Yeah I think just a warning message that's emitted if a certain patterns are matched are good enough: like you said, it might come up with false positives but it's like a good "you better know what you're doing" |
Instead of calling the CodeBase iterator directly, finder now splits the scan into two steps: 1. Build a list of all the files that might be in the code base. 2. Determine which files are in the code base. The first step is a lot quicker than the second one, and tqdm can use the length of the list to create a progress bar. Signed-off-by: John Pennycook <[email protected]>
I've had to split the difference here. Materializing the Determining which of the files in the list is actually a source file we care about is then implemented as a second pass, wrapped in
I'll open a separate issue to track this. |
If a code base contains many files (e.g., because a build directory was accidentally included) it can take a long time to identify the source files in the code base. This commit adds a progress tracker with tqdm to make it clear that CBI is doing something and hasn't just stalled.
We don't get a progress bar, because the number of files in the code base is not known a priori, but tqdm will print the number of files found so far.
Related issues
Proposed changes
filenames
set programmatically instead of usingset(codebase)
so that we can track progress.filenames
construction intqdm
.Note that the addition of files listed explicitly in the configuration (L176-L179) is not tracked. In my testing, these three lines basically take no time at all. The reason that
set(codebase)
takes so long is that theCodeBase
iterator callsrglob(*)
and checks every path in the directory wherecodebasin
is run -- so time scales with the number of files in the directory, not the number of files in thecompile_commands.json
file. It may be possible to improve the performance of this iterator, but I think we can treat that as orthogonal to this UX improvement.