Writing a Tutorial

A tutorial in this repository is referred to a Jupyter notebook which is written for a dataset to either do a deep analysis of it, or apply some ML technique to it. Some possibilities for tutorials are as follows:

Walk through of a dataset, go over its main features, use data visualization to tell a story about the dataset
Correlation/causation analysis
Time series analysis
Supervised Learning, such as Classification, Regression, Forecasting, etc
Unsupervised Learning, such as clustering

Choose a Dataset

Make sure the dataset you choose is tabular and onboarded by our team. There should be a directory available for that dataset here.

Write your Tutorial

Your tutorial can be anything you want, as long as it shows something interesting about the dataset.

Colab Development

You may start by downloading a copy of the template and upload it to Colab.

Development Tips

While Colab offers many cool macros and shortcuts, we ask you not to use them, since these tutorials should also be runnable in Workbench, and locally.
To help the reader understand your tutorial easier, make sure to add enough description in markdown cells before your code cells.
We encourage you to submit your notebook for code review via a Pull Request on GitHub. This helps us to keep track of your progress, and you have access to the history of your work later on.
When you are done with your code, try and download your notebook to your local machine and run it locally to make sure it still runs without any issues.

Providing Metadata

Each tutorial requires a number of metadata, which should be stored in a artifact.yaml file. Here is an example of how the file should look like:

artifact:
  title: The title of your tutorial
  description: A brief description of what the tutorial is about.
  tags:
    - libraries:sklearn,matplotlib
    - ml:classification
    - vertical:government
    - tier:free

The vertical variable is one of healthcare, environment, finance, information, education, retail, government, and manufacturing. tier' is one of free' or paid, and it is paid only when the tutorial requires some GCP services such as Vertex AI.

Testing

Each notebook tutorial requires a test file. We use testbook to write our test units. Let's assume we have a simple notebook called my_notebook.ipynb with the following four cells:

# Cell 1
import pandas as pd
from google.cloud import bigquery

# Cell 2
QUERY = 'SELECT * FROM table LIMIT 1000'

# Cell 3
bqclient = bigquery.Client(project='my_project')
dataframe = bqclient.query(QUERY).result().to_dataframe()

# Cell 4
var = 3 + 4

In our test, we want to mock the bigquery client and avoid making a real request. The trick is to inject a cell before calling bigquery.Client and mock it. Here is how to do it using testbook:

from testbook import testbook

@testbook('./my_notebook.ipynb')
def test_get_details(tb):
    tb.inject(
        """
        import mock
        mock_client = mock.MagicMock()
        mock_df = pd.DataFrame()
        mock_df['week'] = range(10)
        mock_df['count'] = 5
        p1 = mock.patch.object(bigquery, 'Client', return_value=mock_client)
        mock_client.query().result().to_dataframe.return_value = mock_df
        p1.start()
        """,
        before=2,
        run=False
    )
    tb.execute()
    dataframe = tb.get('dataframe')
    assert dataframe.shape == (10, 2)

    var = tb.get('var')
    assert var == 7

A full example can be found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly