Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local version of the [Colab notebook]( https://github.com/tensor… #847

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stellarpower
Copy link

It seems that a number of users are having problems running the profiler on their own setups.
Currently, the main source of documentation is a colaboratory notebook - which is therefore limited to Google environments, and has specific packages and jupyter magic set up. And is currently broken, not displaying any data or even the profile tab. The installa and run script only verifies that tensorboard launches - it does not . It is also not suitable ofr users not using virtualenv for their package management (I have installed the tensorflow packages through a mamba environment and conda-forge).

This commit adds a script that is currently just a copy of the notebook in a vanilla python. With one change that the number of epochs is matched to the setup in the trianing loop (as otherwise no data are collected).

However, as a starting point, I think it ought to be expanded out to include further test cases and any edits made as appropriate to produce the sort of data that is in the /data folder of the repository.

In my case for example, I can open up the data and tensorboard and the profiler correctly display it. But I am getting generic errors that data were not captured for the timesteps involved if I then try to generate my own data. It is not clear what the issue is here, as I have resolved any problems relating to warnings output (and if this script can detect some of those programmatically, then that would be great - e.g. not being able to find the cuda PTI libraries).

Hopefully if a script like this can be expanded, then that would help both users and maintainers with diagnosing issues, especially as it seems the whole TF ecosystem is changing in a way that is breaking links. I think a full reproducible example all the way through from verifying setup through to data generation and then visualisation, that works with GPUs and a user's current installation environment, would be a good thing to add.

Relevant issues: !578 !835 !653 !839 !613

===============================

Add local version of the Colab notebook as a starting point for more reproducible testing for a local installation. This file ideally should be fleshed out to include more tests as part of installation and to ensure that data are being gathered correctly.

…low/tensorboard/blob/master/docs/tensorboard_profiling_keras.ipynb  ) as a starting point for more reproducible testing for a local installation. This file ideally should be fleshed out to include more tests as part of installation and to ensure that data are being gathered correctly.
Copy link

google-cla bot commented Apr 24, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@stellarpower
Copy link
Author

BTW: I assume my use of the profile_batch here may be wrong - but in the notebook, it is a string between '(500,520)' and this logs no data - the data are also batched to a batch size of 128 so that would in theory always be the case. I therefore assumed maybe it was meant to reference how many epochs to log for, but if it is indeed meant to monitor only some examples in each batch, then this will need to be changed, and I think the docs could be clearer on what it means. The use of a string also confuses things a bit, whe na plain tuple would be clearer and amount to the same. The docs for the parameter to the Tensorboard constructor call also claim it is not supported currently but this doesn't appear to be the case(?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant