Computational predictive models to assist in the identification, classification, and study of scientific software.
This model is a binary classifier that predicts whether a developer and an author are the same person. It is trained on a dataset of 3000 developer-author pairs that have been annotated as either matching or not matching.
Given a set of developers and authors, we use the model on each possible pair of developer and author to predict whether they are the same person. The model returns a list of only the found matches in MatchedDevAuthor
objects, each containing the developer, author, and the confidence of the prediction.
from sci_soft_models import dev_author_em
devs = [
dev_author_em.DeveloperDetails(
username="evamaxfield",
name="Eva Maxfield Brown",
),
dev_author_em.DeveloperDetails(
username="nniiicc",
),
]
authors = [
"Eva Brown",
"Nicholas Weber",
]
matches = dev_author_em.match_devs_and_authors(devs=devs, authors=authors)
print(matches)
# [
# MatchedDevAuthor(
# dev=DeveloperDetails(
# username='evamaxfield',
# name='Eva Maxfield Brown',
# email=None,
# ),
# author='Eva Brown',
# confidence=0.9851127862930298
# )
# ]
This model was originally created and managed as a part of rs-graph and as such, to regenerate the dataset for annotation, the following steps can be taken:
git clone https://github.com/evamaxfield/rs-graph.git
cd rs-graph
git checkout c1d8ec89
pip install -e .
rs-graph-modeling create-developer-author-em-dataset-for-annotation