Skip to content

sgotur/KG-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Graph Extractor

A Python tool for extracting knowledge graphs from unstructured text using LLM-based triple extraction. This tool processes text into Subject-Predicate-Object (SPO) triples that can be used to build knowledge graphs.

Features

  • Text chunking with configurable overlap
  • LLM-based triple extraction using OpenAI/DeepSeek API
  • JSON-formatted output
  • Automatic pronoun resolution
  • Error handling and failed chunk tracking
  • Results export to pandas DataFrame

Requirements

  • Python 3.8+
  • OpenAI API key or DeepSeek API credentials
  • Required Python packages (see requirements.txt)

Installation

  1. Clone the repository:
git clone <repository-url>
cd knowledge-graph-extractor
  1. Create and activate a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Configuration

Set your API credentials either through environment variables:

export OPENAI_API_KEY="your-api-key"
export OPENAI_API_BASE="your-api-base-url"  # Optional, for DeepSeek or other providers

Or provide them directly when initializing the extractor:

extractor = KnowledgeGraphExtractor(
    api_key="your-api-key",
    api_base="your-api-base-url"  # Optional
)

Usage

Basic Usage

from extract import KnowledgeGraphExtractor

# Initialize the extractor
extractor = KnowledgeGraphExtractor()

# Process text
text = """
Your unstructured text here...
"""
results = extractor.process_text(text)

# Get results as DataFrame
df = extractor.get_results_dataframe()
print(df)

Running the Example

The repository includes an example script that demonstrates the usage with a sample text about Albert Einstein:

python example.py

Customizing Chunk Size and Overlap

extractor = KnowledgeGraphExtractor()
extractor.chunk_size = 200  # Adjust chunk size
extractor.overlap = 40     # Adjust overlap

Handling Results

results = extractor.process_text(text)

# Access extracted triples
triples = results['triples']

# Check for failed chunks
failed_chunks = results['failed_chunks']
for chunk in failed_chunks:
    print(f"Chunk {chunk['chunk_number']} failed: {chunk['error']}")

Testing

The repository includes unit tests to verify the functionality:

python -m unittest test_extract.py

The tests cover:

  • Text chunking functionality
  • Configuration validation
  • Empty text handling
  • Results DataFrame structure

Output Format

The extracted triples are returned in the following format:

{
    'triples': [
        {
            'subject': 'entity1',
            'predicate': 'relation',
            'object': 'entity2',
            'chunk': 1  # chunk number where this triple was found
        },
        # ... more triples
    ],
    'failed_chunks': [
        {
            'chunk_number': 2,
            'error': 'error message',
            'response': 'raw response'
        },
        # ... any failed chunks
    ]
}

Project Structure

.
├── extract.py           # Main implementation
├── example.py          # Example usage
├── test_extract.py     # Unit tests
├── requirements.txt    # Dependencies
├── README.md          # Documentation
├── LICENSE            # MIT License
└── .gitignore         # Git ignore file

Limitations

  • API rate limits apply based on your OpenAI/DeepSeek account
  • Processing large texts may take time due to API calls
  • Quality of extraction depends on the LLM model used
  • Text chunks are processed independently, which may miss cross-chunk relationships

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Knowledge Graph Generator from Text or from any large content

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages