This repository contains the code and data related to an upcoming article about a workflow to classify the English Short Title Catalogue (ESTC) records by genre. Preprint of the article is available at Zenodo. For proprietary reasons, we can not share all data sets needed for replication publicly. Those data sets can be shared for replication purposes upon a request. Data sets called in the scripts but missing from the publicly available version of the repo are marked with X. Calls to libraries and repositories that contain ESTC data that we can not share publicly have been left as they were when the scripts were ran.
The article elaborates on e.g. the structure of the classification workflow comprehensively. Hence, the descriptions here are concise and written to complement and connect to the article.
This folder includes the classification, analysis and tool (e.g. exploratory tools used in the article) R scripts related to the article.
The main script of the workflow to classify ESTC records by genre. Takes the steps 1-2 as the starting point and continues to the other steps. In addition to the genre data, evaluation samples analysed in the article are also generated in this script.
A script that contains the data analyses of the manually annotated samples and the genre data generated by the workflow. Figures 2-5 and table 4 are produced in this script.
Minor polishing and implementation of the supercategories for the genre data.
A script that gathers information about periodical publications and combines that data with the manually annotated documents, executing steps 1 and 2 of the workflow.
Demonstrates how statistical approaches were used to find potentially useful keywords and title ngrams for the genre classification workflow.
This subfolder contains files that relate to the results (e.g. classified ESTC records, manually annotated data) obtained in the article.
The very final output that incorporates the discoveries of the evaluation to the genre data. ESTC record per row and the main category, subcategory and supercategory on respective columns.
The evaluated sample of records classified by the workflow. The manually annotated records have a source variable value 'not available', something that was fixed for the post-processed version of the genre data.
The evaluated sample of records not classified by the workflow.
The output of the genre classification workflow, with each row corresponding to an ESTC record with the main category and subcategories on their respective columns.
A table that demonstrates what the tables that were used to look for useful ngrams and keywords to include to the workflow looked like. The last column of the table topic_category_ratio measures how disproportionately a given keyword (named topic in the table) is concentrated to a main category (first column) in manually annotated data. The second last column measures the absolute concentration of a keyword to a main category. These properties allow a more focused analysis of those keywords, that have at least a strong correlational relationship with some main category. Due to proprietary reasons, we can not share full tables that result from such computational analyses in a publicly available repo.
The data about erroneous (each row a record in the sample and columns indicating the real and assigned category) classifications used to create figures 2 and 3. The table summary_table_of_pairwise_labeling_errors.csv is the starting point for this table.
Table that includes the rows and columns of table 5, as well as other columns needed to derive the final coverage estimates.
This subfolder contains files utilised by the classification and analysis pipeline, that are more related to the 'process' rather than 'results' of the article. By default, data sets related to ESTC contain one edition per row, and columns list attributes (id, title, etc.)
The manually annotated ESTC records used in the workflow (step 1). This is combined with the periodicals to create manually_annotated_documents_and_periodicals.csv. Periodical information in this table was not used.
A file that post-processes some of the labeling that the authors did to the manually evaluated samples to a standardised format needed for statistical analysis.
The random sample of documents labelled by the workflow that the two authors evaluated. The annotated version is in the subfolder data_final.
The random sample of documents not labelled by the workflow that the two authors evaluated. The annotated version is in the subfolder data_final.
Files that list ngrams that are at least statistically correlated with a main category, with a field (varies by file) that can be used to filter out ngrams that did not pass the manual evaluation stage.
Full titles of the ESTC records
The keyword data (ESTC record-keyword -pairs) used in the article/workflow.
Documents of the steps 1 and 2 of the workflow combined.
Keywords that were manually selected for the workflow for classification step 3.
Keywords used to detect periodicals.
Strings in microfilm and bibliographic information fields deemed potentially useful in detecting periodicals. The harmonised version of this file includes a column that indicates whether the string was then manually selected to be an indicator of a related record actually being a periodical.
File that harmonises subcategory information to a more standardised format (e.g. trials is converted to trial).
A table that maps the main categories to supercategories.
Monograms used in the second round of labeling ESTC records with genre based on title ngrams (step 5).
Keywords used to label ESTC records on the second round of labeling ESTC records with genre based on the ESTC keywords (step 6).
For some of the scripts produce these to more accurately describe the conditions in which the script was run the last time.
Scripts generate the visualisations shown in the article here.