Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Commit 37f804b

Browse files
author
saidbleik
committed
update examples folder refs
1 parent 2b92e13 commit 37f804b

File tree

12 files changed

+95
-104
lines changed

12 files changed

+95
-104
lines changed

.amlignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
data/
2-
scenarios/
2+
examples/

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
In recent years, Natural Language Processing has seen quick growth in quality and usability, and this has helped to drive business adoption of Artificial Intelligence solutions. In the last few years, researchers have been applying newer deep learning methods to natural language processing. Data Scientists started moving from traditional methods to state-of-the-art DNN algorithms which allow them to use language models pretrained on large text corpora.
44

5-
This repository contains examples and best practices for building natural language processing (NLP) systems, provided as [Jupyter notebooks](scenarios) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
5+
This repository contains examples and best practices for building natural language processing (NLP) systems, provided as [Jupyter notebooks](examples) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
66

77
## Overview
88

examples/embeddings/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Word Embedding
2+
3+
This folder contains examples and best practices, written in Jupyter notebooks, for training word embedding on custom data from scratch.
4+
There are
5+
three typical ways for training word embedding:
6+
[Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf),
7+
[GloVe](https://nlp.stanford.edu/pubs/glove.pdf), and [fastText](https://arxiv.org/abs/1607.01759).
8+
All of the three methods provide pretrained models ([pretrained model with
9+
Word2Vec](https://code.google.com/archive/p/word2vec/), [pretrained model with
10+
Glove](https://github.com/stanfordnlp/GloVe), [pretrained model with
11+
fastText](https://fasttext.cc/docs/en/crawl-vectors.html)).
12+
These pretrained models are trained with
13+
general corpus like Wikipedia data, Common Crawl data, etc., and may not serve well for situations
14+
where you have a domain-specific language learning problem or there is no pretrained model for the
15+
language you need to work with. In this folder, we provide examples of how to apply each of the
16+
three methods to train your own word embeddings.
17+
18+
# What is Word Embedding?
19+
20+
Word embedding is a technique to map words or phrases from a vocabulary to vectors or real numbers.
21+
The learned vector representations of words capture syntactic and semantic word relationships and
22+
therefore can be very useful for tasks like sentence similary, text classifcation, etc.
23+
24+
25+
## Summary
26+
27+
28+
|Notebook|Environment|Description|Dataset|
29+
|---|---|---|---|
30+
|[Developing Word Embeddings](embedding_trainer.ipynb)|Local| A notebook shows how to learn word representation with Word2Vec, fastText and Glove|[STS Benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#STS_benchmark_dataset_and_companion_dataset) |

examples/question_answering/question_answering_system_bidaf_quickstart.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@
176176
"metadata": {},
177177
"source": [
178178
"This step downloads the pre-trained [AllenNLP](https://allennlp.org/models) pretrained model and registers the model in our Workspace. The pre-trained AllenNLP model we use is called Bidirectional Attention Flow for Machine Comprehension ([BiDAF](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02\n",
179-
")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp/blob/courtney-bidaf/scenarios/question_answering/bidaf_deep_dive.ipynb\n",
179+
")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp/examples/question_answering/bidaf_deep_dive.ipynb\n",
180180
") for more information on this algorithm and AllenNLP implementation."
181181
]
182182
},

examples/sentence_similarity/gensen_local.ipynb

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@
9494
"from utils_nlp.dataset import snli, preprocess\n",
9595
"from utils_nlp.models.pretrained_embeddings.glove import download_and_extract\n",
9696
"from utils_nlp.dataset import Split\n",
97-
"from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
97+
"from examples.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
9898
"\n",
9999
"print(\"System version: {}\".format(sys.version))"
100100
]
@@ -602,21 +602,21 @@
602602
"text": [
603603
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.8 and num_layers=1\n",
604604
" \"num_layers={}\".format(dropout, num_layers))\n",
605-
"../../scenarios/sentence_similarity/gensen_train.py:431: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
605+
"../../examples/sentence_similarity/gensen_train.py:431: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
606606
" torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n",
607607
"../../utils_nlp/models/gensen/utils.py:364: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
608608
" Variable(torch.LongTensor(sorted_src_lens), volatile=True)\n",
609609
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/functional.py:1332: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.\n",
610610
" warnings.warn(\"nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.\")\n",
611611
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/functional.py:1320: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.\n",
612612
" warnings.warn(\"nn.functional.tanh is deprecated. Use torch.tanh instead.\")\n",
613-
"../../scenarios/sentence_similarity/gensen_train.py:523: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
613+
"../../examples/sentence_similarity/gensen_train.py:523: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
614614
" torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n",
615615
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/horovod/torch/__init__.py:163: UserWarning: optimizer.step(synchronize=True) called after optimizer.synchronize(). This can cause training slowdown. You may want to consider using optimizer.step(synchronize=False) if you use optimizer.synchronize() in your code.\n",
616616
" warnings.warn(\"optimizer.step(synchronize=True) called after \"\n",
617-
"../../scenarios/sentence_similarity/gensen_train.py:243: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
617+
"../../examples/sentence_similarity/gensen_train.py:243: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
618618
" f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n",
619-
"../../scenarios/sentence_similarity/gensen_train.py:262: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
619+
"../../examples/sentence_similarity/gensen_train.py:262: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
620620
" f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n"
621621
]
622622
},

examples/sentence_similarity/gensen_wrapper.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
import json
44
import os
55

6-
from scenarios.sentence_similarity.gensen_train import train
6+
from examples.sentence_similarity.gensen_train import train
77
from utils_nlp.eval.classification import compute_correlation_coefficients
88
from utils_nlp.models.gensen.create_gensen_model import (
99
create_multiseq2seq_model,

scenarios/embeddings/README.md

Lines changed: 0 additions & 30 deletions
This file was deleted.

setup.py

Lines changed: 10 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6,34 +6,29 @@
66
import io
77

88
import re
9-
from glob import glob
10-
from os.path import basename, dirname, join, splitext
9+
from os.path import dirname, join
1110

12-
from setuptools import find_packages, setup
11+
from setuptools import setup
1312
from setuptools_scm import get_version
1413

1514
# Determine semantic versioning automatically
1615
# from git commits
1716
__version__ = get_version()
1817

18+
1919
def read(*names, **kwargs):
20-
with io.open(
21-
join(dirname(__file__), *names),
22-
encoding=kwargs.get("encoding", "utf8"),
23-
) as fh:
20+
with io.open(join(dirname(__file__), *names), encoding=kwargs.get("encoding", "utf8")) as fh:
2421
return fh.read()
2522

2623

2724
setup(
2825
name="utils_nlp",
29-
version = __version__,
26+
version=__version__,
3027
license="MIT License",
3128
description="NLP Utility functions that are used for best practices in building state-of-the-art NLP methods and scenarios. Developed by Microsoft AI CAT",
3229
long_description="%s\n%s"
3330
% (
34-
re.compile("^.. start-badges.*^.. end-badges", re.M | re.S).sub(
35-
"", read("README.md")
36-
),
31+
re.compile("^.. start-badges.*^.. end-badges", re.M | re.S).sub("", read("README.md")),
3732
re.sub(":[a-z]+:`~?(.*?)`", r"``\1``", read("CONTRIBUTING.md")),
3833
),
3934
author="AI CAT",
@@ -68,16 +63,11 @@ def read(*names, **kwargs):
6863
"Documentation": "https://github.com/microsoft/nlp/",
6964
"Issue Tracker": "https://github.com/microsoft/nlp/issues",
7065
},
71-
keywords=[
72-
"Microsoft NLP",
73-
"Natural Language Processing",
74-
"Text Processing",
75-
"Word Embedding",
76-
],
66+
keywords=["Microsoft NLP", "Natural Language Processing", "Text Processing", "Word Embedding"],
7767
python_requires=">=3.6",
78-
install_requires=['setuptools_scm>=3.2.0',],
68+
install_requires=["setuptools_scm>=3.2.0"],
7969
dependency_links=[],
8070
extras_require={},
81-
use_scm_version = {"root": ".", "relative_to": __file__},
82-
setup_requires=['setuptools_scm'],
71+
use_scm_version={"root": ".", "relative_to": __file__},
72+
setup_requires=["setuptools_scm"],
8373
)

tests/integration/test_notebooks_question_answering.py

Lines changed: 40 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -11,45 +11,49 @@
1111

1212
@pytest.mark.integration
1313
@pytest.mark.azureml
14-
def test_bidaf_deep_dive(notebooks,
15-
subscription_id,
16-
resource_group,
17-
workspace_name,
18-
workspace_region):
14+
def test_bidaf_deep_dive(
15+
notebooks, subscription_id, resource_group, workspace_name, workspace_region
16+
):
1917
notebook_path = notebooks["bidaf_deep_dive"]
20-
pm.execute_notebook(notebook_path,
21-
OUTPUT_NOTEBOOK,
22-
parameters = {'NUM_EPOCHS':2,
23-
'config_path': "tests/ci",
24-
'PROJECT_FOLDER': "scenarios/question_answering/bidaf-question-answering",
25-
'SQUAD_FOLDER': "scenarios/question_answering/squad",
26-
'LOGS_FOLDER': "scenarios/question_answering/",
27-
'BIDAF_CONFIG_PATH': "scenarios/question_answering/",
28-
'subscription_id': subscription_id,
29-
'resource_group': resource_group,
30-
'workspace_name': workspace_name,
31-
'workspace_region': workspace_region})
18+
pm.execute_notebook(
19+
notebook_path,
20+
OUTPUT_NOTEBOOK,
21+
parameters={
22+
"NUM_EPOCHS": 2,
23+
"config_path": "tests/ci",
24+
"PROJECT_FOLDER": "examples/question_answering/bidaf-question-answering",
25+
"SQUAD_FOLDER": "examples/question_answering/squad",
26+
"LOGS_FOLDER": "examples/question_answering/",
27+
"BIDAF_CONFIG_PATH": "examples/question_answering/",
28+
"subscription_id": subscription_id,
29+
"resource_group": resource_group,
30+
"workspace_name": workspace_name,
31+
"workspace_region": workspace_region,
32+
},
33+
)
3234
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["validation_EM"]
3335
assert result == pytest.approx(0.5, abs=ABS_TOL)
3436

3537

3638
@pytest.mark.usefixtures("teardown_service")
3739
@pytest.mark.integration
3840
@pytest.mark.azureml
39-
def test_bidaf_quickstart(notebooks,
40-
subscription_id,
41-
resource_group,
42-
workspace_name,
43-
workspace_region):
41+
def test_bidaf_quickstart(
42+
notebooks, subscription_id, resource_group, workspace_name, workspace_region
43+
):
4444
notebook_path = notebooks["bidaf_quickstart"]
45-
pm.execute_notebook(notebook_path,
46-
OUTPUT_NOTEBOOK,
47-
parameters = {'config_path': "tests/ci",
48-
'subscription_id': subscription_id,
49-
'resource_group': resource_group,
50-
'workspace_name': workspace_name,
51-
'workspace_region': workspace_region,
52-
'webservice_name': "aci-test-service"})
45+
pm.execute_notebook(
46+
notebook_path,
47+
OUTPUT_NOTEBOOK,
48+
parameters={
49+
"config_path": "tests/ci",
50+
"subscription_id": subscription_id,
51+
"resource_group": resource_group,
52+
"workspace_name": workspace_name,
53+
"workspace_region": workspace_region,
54+
"webservice_name": "aci-test-service",
55+
},
56+
)
5357
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["answer"]
5458
assert result == "Bi-Directional Attention Flow"
5559

@@ -64,12 +68,12 @@ def test_bert_qa_runs(notebooks):
6468
OUTPUT_NOTEBOOK,
6569
parameters=dict(
6670
AZUREML_CONFIG_PATH="./tests/integration/.azureml",
67-
DATA_FOLDER='./tests/integration/squad',
68-
PROJECT_FOLDER='./tests/integration/pytorch-transformers',
69-
EXPERIMENT_NAME='NLP-QA-BERT-deepdive',
70-
BERT_UTIL_PATH='./utils_nlp/azureml/azureml_bert_util.py',
71-
EVALUATE_SQAD_PATH = './utils_nlp/eval/evaluate_squad.py',
72-
TRAIN_SCRIPT_PATH="./scenarios/question_answering/bert_run_squad_azureml.py",
71+
DATA_FOLDER="./tests/integration/squad",
72+
PROJECT_FOLDER="./tests/integration/pytorch-transformers",
73+
EXPERIMENT_NAME="NLP-QA-BERT-deepdive",
74+
BERT_UTIL_PATH="./utils_nlp/azureml/azureml_bert_util.py",
75+
EVALUATE_SQAD_PATH="./utils_nlp/eval/evaluate_squad.py",
76+
TRAIN_SCRIPT_PATH="./examples/question_answering/bert_run_squad_azureml.py",
7377
BERT_MODEL="bert-base-uncased",
7478
NUM_TRAIN_EPOCHS=1.0,
7579
NODE_COUNT=1,

tests/integration/test_notebooks_sentence_similarity.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ def test_gensen_local(notebooks):
4343
kernel_name=KERNEL_NAME,
4444
parameters=dict(
4545
max_epoch=1,
46-
config_filepath="scenarios/sentence_similarity/gensen_config.json",
46+
config_filepath="examples/sentence_similarity/gensen_config.json",
4747
base_data_path="data",
4848
),
4949
)
@@ -143,8 +143,8 @@ def test_similarity_gensen_azureml_runs(notebooks):
143143
AZUREML_CONFIG_PATH="./tests/integration/.azureml",
144144
UTIL_NLP_PATH="./utils_nlp",
145145
MAX_EPOCH=1,
146-
TRAIN_SCRIPT="./scenarios/sentence_similarity/gensen_train.py",
147-
CONFIG_PATH="./scenarios/sentence_similarity/gensen_config.json",
146+
TRAIN_SCRIPT="./examples/sentence_similarity/gensen_train.py",
147+
CONFIG_PATH="./examples/sentence_similarity/gensen_config.json",
148148
MAX_TOTAL_RUNS=1,
149149
MAX_CONCURRENT_RUNS=1,
150150
),

tests/notebooks_common.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,4 @@
1111

1212
def path_notebooks():
1313
"""Returns the path of the notebooks folder"""
14-
return os.path.abspath(
15-
os.path.join(os.path.dirname(__file__), os.path.pardir, "scenarios")
16-
)
17-
14+
return os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir, "examples"))

utils_nlp/interpreter/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ This submodule contains a tool for explaining hidden states of models. It is an
55

66
## How to use
77

8-
We provide a notebook tutorial [here](../../scenarios/interpret_NLP_models/understand_models.ipynb) to help you get started quickly. The main class needed is the `Interpreter` in [Interpreter.py](Interpreter.py). Given any input word embeddings and a forward function $\Phi$ that transforms the word embeddings $\bf x$ to a hidden state $\bf s$, the Interpreter helps understand how much each input word contributes to the hidden state. Suppose the $\Phi$, the input $\bf x$ and the input words are defined as:
8+
We provide a notebook tutorial [here](../../examples/interpret_NLP_models/understand_models.ipynb) to help you get started quickly. The main class needed is the `Interpreter` in [Interpreter.py](Interpreter.py). Given any input word embeddings and a forward function $\Phi$ that transforms the word embeddings $\bf x$ to a hidden state $\bf s$, the Interpreter helps understand how much each input word contributes to the hidden state. Suppose the $\Phi$, the input $\bf x$ and the input words are defined as:
99
```
1010
import torch
1111
@@ -63,5 +63,5 @@ which means that the second and forth words are most important to $\Phi$, which
6363

6464
## Explain a certain layer in any saved pytorch model
6565

66-
We provide an example on how to use our method to explain a saved pytorch model (*pre-trained BERT model in our case*) [here](../../scenarios/interpret_NLP_models/understand_models.ipynb).
66+
We provide an example on how to use our method to explain a saved pytorch model (*pre-trained BERT model in our case*) [here](../../examples/interpret_NLP_models/understand_models.ipynb).
6767
> NOTE: This result may not be consistent with the result in the paper because we use the pre-trained BERT model directly for simplicity, while the BERT model we use in paper is fine-tuned on a specific dataset like SST-2.

0 commit comments

Comments
 (0)