Summary Project

This repo is used for our Summarization team for LING 575 Course at UW Seattle.

Setup

To set up the project environment, follow the steps below:

Navigate to the "setup" folder using the command line:
```
cd setup
```
Change the permission of the create_env.sh script to make it executable:
```
chmod +x create_env.sh
```
Run the create_env.sh script to create the conda environment:
```
./create_env.sh
```
Activate the newly created environment:
```
conda activate Summary
```

Components

Preprocessing
- File: src/doc_processor/doc_processing.py
- This file contains functions and code for preprocessing raw data, including cleaning, formatting, and transforming the data.
Content Selection
- File: src/model/content_selector.py
- This file contains methods for selecting salient sentences for extractive summarization.
Information Ordering
- File: src/model/information_orderer.py
- This file contains methods for coherently reordering selected content for inclusion in a summary.
Content Realization
- File: src/model/content_realizer.py
- This file contains methods for realizing ordered content as a summary.

Scripts

scripts/run_main.sh: This is a script that runs the system per the parameters found in config.json Usage:
```
cd scripts
./run_main.sh
```

scripts/proc_docset.sh: This is a script that finds, collates, and tokenizes data stored in the corpora/ directory on patas. Note that data/ is not tracked by git given the size of the files. Usage:

  cd scripts
  ./proc_docset.sh "/mnt/dropbox/23-24/575x/Data/Documents/training/2009/UpdateSumm09_test_topics.xml"  "../data/training"  
  ./proc_docset.sh "/mnt/dropbox/23-24/575x/Data/Documents/devtest/GuidedSumm10_test_topics.xml"  "../data/devtest"
  ./proc_docset.sh "/mnt/dropbox/23-24/575x/Data/Documents/evaltest/GuidedSumm11_test_topics.xml"  "../data/evaltest"
  ```

Configs

All arguments for the system are passed through the config file (config.json):

Primary Task
- "document_processing": identifies arguments associated with ingesting and processing the original ACQUAINT and ACQUAINT-2 files.
  - "data_ingested": if any of the values are set to true, the system will load cached data from data/
  - "input_xml_file": identifies the location on patas for the XML files which identify the training, devtest, and evaltest documents.
  - "output_dir": identifies the directories in which to write out preprocessed files from the corpora
- "model": arguments associated with the core summarization features of the system.
  - "content_selection": arguments associated with the content selection component.
    - "approach": identifies the content selection approach to use ("tf-idf", "textrank", "topic_focused", or "baseline").
      - "num_sentences_per_doc": max number of sentences to select from each document
      - "similarity threshold": (textrank only) minimum similarity score for inclusion in graph
      - "model_id": (textrank, topic_focused) for textrank and topic_focused, the transformers and sentence-transformers model identifier, respectively. Recommended models:
        
        textrank: bert-base-cased, distilbert-base-cased, etc.
        
        topic_focused: paraphrase-distilroberta-base-v1
  - "information_ordering": argument associated with the information ordering component.
    - "approach": identifies the information ordering approach to use ("random", "TSP", "entity_grid", "baseline").
      - "training_data_path": ("entity_grid") path to the directory containing training data
      - "all_possible_permutations_threshold": ("entity_grid") if selected sentences are under this threshold, calculate all permutations
      - "max_permutations": ("entity_grid") if not under threshold, the maximum number of permutations to search
  - "content_realization": argument associated with the content realization component.
    - "approach": identifies the information ordering approach to use ("simple", "advanced", "generative", "baseline").
      - "max_length": maximum number of word tokens allowed in output
      - "min_length": ("advanced") minimum word tokens required in output
      - "model_id": ("advanced", "generative") transformers or openai identifier for the desired model. Recommended models:
        
        advanced: t5-small, t5-base, etc.
        
        generative: gpt-3.5-turbo
      - "compression_method": ("advanced") defines the method associated with sentence compression must be "neural"
      - "do_sample": ("advanced") enables decoding strategies for next token selection
      - "temperature": ("generative") a number between 0 and 1 reflecting the sampling temperature for generative approach
      - "n": ("generative") number of desired responses from the OpenAI api.
- "evaluation": identifies the evaluation metrics and associated output paths for results.
- "output_dir": identifies the directory where the system summaries are written.

License

Distributed under the Apache License 2.0. See LICENSE for more information.

Authors

Ben Cote
- Email: bpc23 at uw.edu
Mohamed Elkamhawy
- Email: mohame at uw.edu
Karl Haraldsson
- Email: kharalds at uw.edu
Alyssa Vecht
- Email: avecht at uw.edu
Josh Warzecha
- Email: jmwar73 at uw.edu

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
.vs		.vs
data		data
doc		doc
outputs		outputs
results		results
scripts		scripts
setup		setup
src		src
tests		tests
.gitignore		.gitignore
D2.cmd		D2.cmd
D3.cmd		D3.cmd
D4.cmd		D4.cmd
D5.cmd		D5.cmd
D5_cpu.cmd		D5_cpu.cmd
LICENSE		LICENSE
README.md		README.md
config.json		config.json
gpu.test		gpu.test
gpu_test.cmd		gpu_test.cmd
test_content_realization.cmd		test_content_realization.cmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary Project

Setup

Components

Scripts

Configs

License

Authors

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Summary Project

Setup

Components

Scripts

Configs

License

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages