GitHub - manasgaur/CLPsych2022: Computational Psychology Shared Task organized with NAACL 2022

CLPsych 2022 Shared Task Structure

Organize the files (Next 2 weeks)
Task : Predict the moments of change in the posts made by the user. Following are some functions needed to create baseline models
- data_reader.py : Show take as input certain path to a training dataset containing all the timelines
- evaluator.py : Make custom functions for Precision, Recall, F1-Score, and other relevant metrics
- utils.py : store the results inside utils.
- model_building.py : Deep language models specific for the task. Numpy and Torch are acceptable
Each data point would be an array of:
- Timeline ID
- Post ID
- User ID
- Date
- Label : ['IS', 'IE', 'O']
- text : Post made a user at particular instance of time.
While assessing the baseline models, we are specifically interested in 'IS' and 'IE' labels.

Usage

Loading dataset

You can simply load dataset by inputing file name.

dataset = csv_reader("data/sample.csv")

Create embeddings

Select embeddings model type. In this repository, we provide three ways to define numerical embeddings of the textual data. (a) TF-IDF, (b) Sentence Transformer, and (c) GLoVE. You can use either of these incorporated embedding methods or introduce your own by adding another if-else block inside the call function. For instance, BERT embeddings can be used as described in from; https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/ .

from model_embeddings import modelEmbeddings
embeddings_model = modelEmbeddings(model_type = `glove')
embeddings = embeddings_model(documents)

Here model_type can take following values

glove : Glove embeddings
tfidf : tf-idf vectorizer
sentence_transformer : bert-base-uncased pre-trained embeddings; stsb-roberta-large (https://huggingface.co/sentence-transformers/stsb-roberta-large)
Loction for other complex model_type's that can be used: https://huggingface.co/sentence-transformers/

There is a subtle difference tf-idf and embedding models lies in the engineered features. TF-IDF is like bag of words, discrete, whereas embedding models are continous semantic representation of words or sentences. Best way to select the model_type is by computing the similarity between words. Project this similarity into T-SNE or heatmap to analyze which model_type's word similarity scores are sensible, intuitively.

Loading pre-trained embeddings

from model_embeddings import modelEmbeddings
embeddings_model = modelEmbeddings(model_type = `tfidf')
embeddings = embeddings_model(documents,load_path='models/tfidf_vectorizer.pkl')

Saving trained model to custom location

from model_embeddings import modelEmbeddings
embeddings_model = modelEmbeddings(model_type = `tfidf')
embeddings = embeddings_model(documents,save_path='models/tfidf_vectorizer.pkl')

Training

Train basic SVM classifier and save trained model to custom location

dataset = csv_reader("data/sample.csv")
from data_reader import csv_reader
from train import Classifier
classifier, = Classifier(dataframe=dataset,embeddings_model_type='sentence_transformer',vectorizer_path = None)
model_path = classifier.train_predict()
eval = evaluator(classifier)

Evaluation

Evaluate trained model using evaluator

from evaluator import evaluator
eval = evaluator(classifier)
print (eval.precision())
print (eval.recall())
print (eval.accuracy())

Prediction

Predictions using pre-trained model at certain path

test_dataset = csv_reader("data/test.csv")
classifier = Classifier(dataframe=dataset,embeddings_model_type='sentence_transformer',vectorizer_path = None)
pred_list,test_list = classifier.predict(model_path='models/svm.pkl')
print(pred_list)

Download Embeddings

This repository would store all your pre-trained or fine-tuned embedding models. Also, we suggest storing your trained models here (.pkl (pickle), .npy (numpy), .hd5 are good methods to store trained models.)

There are the two sources from where you can download the GLoVE Embeddings:

If you are interested in converting GLoVE to word2vec, a good resource is https://radimrehurek.com/gensim/scripts/glove2word2vec.html

Download Word2Vec Embeddings: https://radimrehurek.com/gensim/models/word2vec.html

One Stop Shops for Embeddings:

For issues, please email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
__pycache__		__pycache__
data		data
models		models
.gitattributes		.gitattributes
Readme.md		Readme.md
bert_trainer.py		bert_trainer.py
data_reader.py		data_reader.py
evaluation_function.py		evaluation_function.py
evaluator.py		evaluator.py
glove_vector.py		glove_vector.py
main.py		main.py
model_embeddings.py		model_embeddings.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

Loading dataset

Create embeddings

Training

Evaluation

Prediction

Download Embeddings

About

Releases

Packages

Contributors 4

Languages

manasgaur/CLPsych2022

Folders and files

Latest commit

History

Repository files navigation

Usage

Loading dataset

Create embeddings

Training

Evaluation

Prediction

Download Embeddings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages