-
Notifications
You must be signed in to change notification settings - Fork 70
Writing a Tutorial
A tutorial in this repository is referred to a Jupyter notebook which is written for a dataset to either do a deep analysis of it, or apply some ML technique to it. Some possibilities for tutorials are as follows:
- Walk through of a dataset, go over its main features, use data visualization to tell a story about the dataset
- Correlation/causation analysis
- Time series analysis
- Supervised Learning, such as Classification, Regression, Forecasting, etc
- Unsupervised Learning, such as clustering
Make sure the dataset you choose is tabular and onboarded by our team. There should be a directory available for that dataset here.
Your tutorial can be anything you want, as long as it shows something interesting about the dataset.
You may start by downloading a copy of the template and upload it to Colab.
- While Colab offers many cool macros and shortcuts, we ask you not to use them, since these tutorials should also be runnable in Workbench, and locally.
- To help the reader understand your tutorial easier, make sure to add enough description in markdown cells before your code cells.
- We encourage you to submit your notebook for code review via a Pull Request on GitHub. This helps us to keep track of your progress, and you have access to the history of your work later on.
- When you are done with your code, try and download your notebook to your local machine and run it locally to make sure it still runs without any issues.
Each tutorial requires a number of metadata, which should be stored in a artifact.yaml
file. Here is an example of how the file should look like:
artifact:
title: The title of your tutorial
description: A brief description of what the tutorial is about.
tags:
- libraries:sklearn,matplotlib
- ml:classification
- vertical:government
- tier:free
The vertical
variable is one of healthcare
, environment
, finance
, information
, education
, retail
, government
, and manufacturing
. tier' is one of
free' or paid
, and it is paid
only when the tutorial requires some GCP services such as Vertex AI.
Each notebook tutorial requires a test file. We use testbook to write our test units. Let's assume we have a simple notebook called my_notebook.ipynb
with the following four cells:
# Cell 1
import pandas as pd
from google.cloud import bigquery
# Cell 2
QUERY = 'SELECT * FROM table LIMIT 1000'
# Cell 3
bqclient = bigquery.Client(project='my_project')
dataframe = bqclient.query(QUERY).result().to_dataframe()
# Cell 4
var = 3 + 4
In our test, we want to mock the bigquery client and avoid making a real request. The trick is to inject a cell before calling bigquery.Client
and mock it. Here is how to do it using testbook
:
from testbook import testbook
@testbook('./my_notebook.ipynb')
def test_get_details(tb):
tb.inject(
"""
import mock
mock_client = mock.MagicMock()
mock_df = pd.DataFrame()
mock_df['week'] = range(10)
mock_df['count'] = 5
p1 = mock.patch.object(bigquery, 'Client', return_value=mock_client)
mock_client.query().result().to_dataframe.return_value = mock_df
p1.start()
""",
before=2,
run=False
)
tb.execute()
dataframe = tb.get('dataframe')
assert dataframe.shape == (10, 2)
var = tb.get('var')
assert var == 7
A full example can be found here.