-
Notifications
You must be signed in to change notification settings - Fork 685
Created a new pipeline which takes owl files as direct input #1529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Created a new pipeline which takes owl files as direct input #1529
Conversation
Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
Please make sure all the checkboxes are checked:
|
WalkthroughAdds a new ontology graph ingestion pipeline module implementing async RDF/OWL parsing into triple dicts, mapping triples to Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Runner as run_ontology_pipeline
participant Loader as load_ontology_data
participant Converter as ontology_to_datapoints
participant Task as ontology_ingestion_task
participant Store as add_data_points
Client->>Runner: run_ontology_pipeline(file, format, dataset_name)
activate Runner
Runner->>Runner: prune/reset env, init Cognee, create dataset
Runner->>Loader: load_ontology_data(file, format)
Loader-->>Runner: triples (list[dict])
Runner->>Converter: ontology_to_datapoints(triples)
Converter-->>Runner: list[OntologyEntity] datapoints
Runner->>Task: ontology_ingestion_task(datapoints, format)
Task->>Store: add_data_points(datapoints)
Store-->>Task: ack/result
loop stream statuses
Task-->>Runner: status update
Runner-->>Client: yield status update
end
deactivate Runner
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Vattikuti-Manideep-Sitaram, thank you for submitting a PR! We will respond as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (2)
cognee/api/v1/cognify/ontology_graph_pipeline.py (2)
50-50
: Consider preserving RDF triple structure in text representation.The simple concatenation
f"{t['subject']} {t['predicate']} {t['object']}"
creates a space-separated string that loses semantic boundaries and makes URIs hard to distinguish. For better downstream processing and readability, consider using N-Triples format or a more structured representation.Use N-Triples format for clarity:
- text = f"{t['subject']} {t['predicate']} {t['object']}" + # N-Triples format: <subject> <predicate> <object> . + text = f"<{t['subject']}> <{t['predicate']}> <{t['object']}> ."Or preserve structure with delimiters:
- text = f"{t['subject']} {t['predicate']} {t['object']}" + text = f"Subject: {t['subject']} | Predicate: {t['predicate']} | Object: {t['object']}"
69-70
: Move imports to module level.Importing modules inside a function is a code smell that can cause issues with:
- IDE autocomplete and type checking
- Import time side effects
- Code readability
Move these imports to the top of the file:
+from cognee.low_level import setup +import cognee + from cognee.modules.pipelines import run_tasks from cognee.modules.pipelines.tasks.task import TaskAnd remove them from the function:
async def run_ontology_pipeline(ontology_file, dataset_name: str = "ontology_dataset"): """ Run the ontology ingestion pipeline directly from a file object (no file path). """ - from cognee.low_level import setup - import cognee -
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (5)
cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)
🔇 Additional comments (2)
cognee/api/v1/cognify/ontology_graph_pipeline.py (2)
83-84
: Pass format parameter through the pipeline.If you implement format auto-detection or configurability in
load_ontology_data
, the format should be passed through toconvert_triples_to_chunks
for correct MIME type assignment.After implementing format configurability, update the pipeline to pass the format:
+async def run_ontology_pipeline( + ontology_file, + dataset_name: str = "ontology_dataset", + format: Optional[str] = None +): ... # ✅ Process ontology file directly - triples = await load_ontology_data(ontology_file) - chunks = convert_triples_to_chunks(triples) + triples = await load_ontology_data(ontology_file, format=format) + chunks = convert_triples_to_chunks(triples, format=format or "xml")
87-94
: LGTM! Task configuration is appropriate.The task definitions for
extract_graph_from_data
andadd_data_points
with batch size 20 are well-structured and follow the existing Task patterns from the codebase.
from cognee.low_level import setup | ||
import cognee | ||
|
||
await cognee.prune.prune_data() | ||
await cognee.prune.prune_system(metadata=True) | ||
await setup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove destructive prune operations or make them optional.
The prune_data()
and prune_system()
calls delete all existing data and system metadata without warning. This is unexpected behavior for a data ingestion pipeline and could result in data loss if a user runs this on a populated instance.
Either:
- Remove these calls entirely and let users manage cleanup separately, or
- Make them opt-in via a parameter:
-async def run_ontology_pipeline(ontology_file, dataset_name: str = "ontology_dataset"):
+async def run_ontology_pipeline(
+ ontology_file,
+ dataset_name: str = "ontology_dataset",
+ clear_existing: bool = False
+):
"""
Run the ontology ingestion pipeline directly from a file object (no file path).
+
+ Args:
+ ontology_file: File-like object containing RDF data
+ dataset_name: Name for the dataset
+ clear_existing: If True, prune existing data and system metadata before ingestion
"""
from cognee.low_level import setup
import cognee
- await cognee.prune.prune_data()
- await cognee.prune.prune_system(metadata=True)
+ if clear_existing:
+ await cognee.prune.prune_data()
+ await cognee.prune.prune_system(metadata=True)
await setup()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from cognee.low_level import setup | |
import cognee | |
await cognee.prune.prune_data() | |
await cognee.prune.prune_system(metadata=True) | |
await setup() | |
async def run_ontology_pipeline( | |
ontology_file, | |
dataset_name: str = "ontology_dataset", | |
clear_existing: bool = False | |
): | |
""" | |
Run the ontology ingestion pipeline directly from a file object (no file path). | |
Args: | |
ontology_file: File-like object containing RDF data | |
dataset_name: Name for the dataset | |
clear_existing: If True, prune existing data and system metadata before ingestion | |
""" | |
from cognee.low_level import setup | |
import cognee | |
if clear_existing: | |
await cognee.prune.prune_data() | |
await cognee.prune.prune_system(metadata=True) | |
await setup() |
🤖 Prompt for AI Agents
In cognee/api/v1/cognify/ontology_graph_pipeline.py around lines 69 to 74, the
code calls cognee.prune.prune_data() and
cognee.prune.prune_system(metadata=True) which perform destructive deletions;
remove these calls or make them opt‑in by adding a parameter (e.g.,
allow_prune=False) to the surrounding function or entry point, wrap the prune
calls in an if allow_prune: block, add a clear warning log before performing
pruning, and ensure the default behavior is non‑destructive (no pruning) so
users must explicitly enable pruning to run those operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed the same pattern of the code_graph_pipeline code. Would love to get some input from the reviewer, Before moving ahead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the code_graph_pipeline file
fd -t f "code_graph_pipeline"
# Search for files containing code_graph_pipeline
rg -l "code_graph_pipeline"
Length of output: 210
🏁 Script executed:
#!/bin/bash
# Examine the code_graph_pipeline.py file, particularly looking for prune operations
cat cognee/api/v1/cognify/code_graph_pipeline.py
Length of output: 4106
@Vattikuti-Manideep-Sitaram, you're absolutely right! I can confirm that code_graph_pipeline.py
has the exact same pattern:
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)
await setup()
This appears to be an established pattern for pipeline operations in the codebase. My concern about the destructive nature of these operations applies equally to both pipelines.
Given that you've followed the existing convention, here are some thoughts for the reviewers/maintainers:
- If this is intentional behavior: Consider adding clear documentation in both pipeline functions warning users that running these pipelines will clear all existing data and system metadata. For example:
async def run_ontology_pipeline(ontology_file, dataset_name: str = "ontology_dataset"):
"""
Run the ontology ingestion pipeline directly from a file object.
⚠️ WARNING: This pipeline clears all existing data and system metadata
before processing. Ensure you have backups if needed.
Args:
ontology_file: File-like object containing RDF data
dataset_name: Name for the dataset
"""
- Alternative approach: If the prune operations are meant for testing/example purposes, consider extracting them to a setup utility that both pipelines can optionally call, or making it configurable across all pipelines consistently.
For the current PR: Since you're following the established pattern, this implementation is consistent with the codebase. However, I'd recommend waiting for maintainer input on whether they want to:
- Keep the current pattern as-is with better documentation
- Refactor both pipelines to make pruning optional
- Keep it as-is without changes
What are your thoughts, reviewers?
✏️ Learnings added
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.585Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (5)
cognee/api/v1/cognify/ontology_graph_pipeline.py (5)
20-20
: Add type hint for the ontology_file parameter.The
ontology_file
parameter lacks a type hint. Consider adding a type annotation to improve code clarity and IDE support.Apply this diff:
-async def load_ontology_data(ontology_file) -> list[dict]: +async def load_ontology_data(ontology_file: Union[str, bytes, "BinaryIO"]) -> list[dict]:Note: You may need to import
BinaryIO
fromtyping
if using that type hint.
54-54
: Chunk size calculation may be inaccurate for URIs.The
len(text.split())
calculation splits on whitespace, which can be inaccurate for RDF triples containing long URIs. For example, a triple likehttp://example.org/subject http://example.org/predicate http://example.org/object
would be counted as 1 word instead of 3 semantic units.Consider using a more accurate calculation method for RDF triples.
Apply this diff:
- chunk_size=len(text.split()), + chunk_size=3, # Each triple has exactly 3 components (subject, predicate, object)
46-46
: Hardcoded MIME type doesn't reflect actual input format.The
mime_type
is hardcoded to"application/rdf+xml"
, but the input ontology file could be in Turtle, N3, or other RDF formats (especially if the suggested format parameter is added toload_ontology_data
).Consider parameterizing the MIME type or deriving it from the actual input format.
If you add a
format
parameter toload_ontology_data
, pass it through to this function:def convert_triples_to_chunks(triples: list[dict], format: str = "xml") -> list[DocumentChunk]: # Map format to MIME type mime_type_map = { "xml": "application/rdf+xml", "turtle": "text/turtle", "n3": "text/n3", "nt": "application/n-triples", # ... add more as needed } mime_type = mime_type_map.get(format, "application/rdf+xml") ontology_doc = Document( # ... mime_type=mime_type )
69-70
: Consider moving imports to module level.Imports are placed inside the function, which is unconventional. While this can be useful for avoiding circular imports or lazy loading, it's generally better to place imports at the module level for clarity and consistency.
Apply this diff to move imports to the top of the file:
+from cognee.low_level import setup +import cognee + # ---------- Step 3: Run Ontology Pipeline ---------- async def run_ontology_pipeline(ontology_file, dataset_name: str = "ontology_dataset"): """ Run the ontology ingestion pipeline directly from a file object (no file path). """ - from cognee.low_level import setup - import cognee - await cognee.prune.prune_data()Only keep imports inside the function if there's a specific reason (e.g., circular import issues).
87-94
: Consider parameterizing task batch size.The
batch_size
is hardcoded to 20 for both tasks. While this may be a reasonable default, different ontologies may benefit from different batch sizes based on their size and complexity.Apply this diff:
-async def run_ontology_pipeline(ontology_file, dataset_name: str = "ontology_dataset"): +async def run_ontology_pipeline( + ontology_file, + dataset_name: str = "ontology_dataset", + batch_size: int = 20 +): """ Run the ontology ingestion pipeline directly from a file object (no file path). + + Args: + ontology_file: File-like object containing RDF/OWL ontology data + dataset_name: Name for the dataset to create + batch_size: Number of chunks to process per batch """ # ... existing code ... # Define pipeline tasks tasks = [ Task( extract_graph_from_data, graph_model=KnowledgeGraph, - task_config={"batch_size": 20}, + task_config={"batch_size": batch_size}, ), - Task(add_data_points, task_config={"batch_size": 20}), + Task(add_data_points, task_config={"batch_size": batch_size}), ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (5)
cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)
🔇 Additional comments (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (1)
1-16
: LGTM!The imports are appropriate for the ontology ingestion pipeline functionality.
Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
…b.com/Vattikuti-Manideep-Sitaram/cognee into feature/owf-or-rdf-files-direct-input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
cognee/api/v1/cognify/ontology_graph_pipeline.py (4)
1-2
: Remove unused imports.The imports
asyncio
andOptional
are not used in this file.Apply this diff:
-import asyncio -from typing import List, Union, Optional +from typing import List, Union
27-40
: Consider validating non-empty graph.The function doesn't check if the parsed graph contains any triples. An empty or invalid ontology file would return an empty list without warning, potentially causing confusion downstream.
Consider adding a check after parsing:
g = Graph() g.parse(ontology_file, format="xml") # Works for standard OWL/RDF/XML + if len(g) == 0: + raise ValueError("Ontology file contains no triples") + triples = []
44-71
: Add input validation.The function doesn't validate that the
triples
list is non-empty. An empty list would create an in-memory Document but no chunks, which may not be the intended behavior.Consider adding validation:
def convert_triples_to_chunks(triples: list[dict]) -> list[DocumentChunk]: """ Convert ontology triples into Cognee-compatible DocumentChunk objects. """ + if not triples: + raise ValueError("Cannot convert empty triples list to chunks") + chunks = []
107-107
: Consider making pipeline name configurable.The pipeline name is hardcoded as
"ontology_pipeline"
. For flexibility, consider adding it as an optional parameter.Apply this diff:
-async def run_ontology_pipeline(ontology_file, dataset_name: str = "ontology_dataset"): +async def run_ontology_pipeline( + ontology_file, + dataset_name: str = "ontology_dataset", + pipeline_name: str = "ontology_pipeline" +): """ Run the ontology ingestion pipeline directly from a file object (no file path). """ # ... rest of function ... # Run tasks with chunks - async for run_status in run_tasks(tasks, dataset.id, chunks, user, "ontology_pipeline"): + async for run_status in run_tasks(tasks, dataset.id, chunks, user, pipeline_name): yield run_status
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.585Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.585Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (5)
cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)
Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (1)
91-124
: Consider adding error handling for better diagnostics.While not critical, wrapping the pipeline in a try-except block would improve error messages and make debugging easier.
Apply this diff to add error handling:
async def run_ontology_pipeline( ontology_file, dataset_name: str = "ontology_dataset", format: Optional[str] = None ): """ Run the ontology ingestion pipeline directly from a file object (no file path). Args: ontology_file: File-like object containing RDF/OWL data dataset_name: Name for the dataset to create format: RDF serialization format (xml, turtle, n3, json-ld, etc.). If None, rdflib will attempt auto-detection. """ + try: from cognee.low_level import setup import cognee await cognee.prune.prune_data() await cognee.prune.prune_system(metadata=True) await setup() user = await get_default_user() db_engine = get_relational_engine() async with db_engine.get_async_session() as session: dataset = await create_dataset(dataset_name, user, session) # ✅ Process ontology file directly triples = await load_ontology_data(ontology_file, format=format) + + if not triples: + raise ValueError("No triples found in ontology file") + chunks = convert_triples_to_chunks(triples, format=format or "xml") # Define pipeline tasks tasks = [ Task( extract_graph_from_data, graph_model=KnowledgeGraph, task_config={"batch_size": 20}, ), Task(add_data_points, task_config={"batch_size": 20}), ] # Run tasks with chunks async for run_status in run_tasks(tasks, dataset.id, chunks, user, "ontology_pipeline"): yield run_status + except Exception as e: + import logging + logging.error(f"Ontology pipeline failed: {str(e)}") + raise RuntimeError(f"Failed to process ontology file: {str(e)}") from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.585Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.585Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (5)
cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)
🔇 Additional comments (2)
cognee/api/v1/cognify/ontology_graph_pipeline.py (2)
51-87
: LGTM! Well-structured triple-to-chunk conversion.The MIME type mapping is comprehensive, and the chunk creation logic correctly preserves triple metadata. The format parameter enables proper content-type handling.
98-100
: Prune operations follow established codebase pattern.The prune calls are consistent with other pipelines (e.g.,
code_graph_pipeline.py
) and appear to be intentional for pipeline initialization. Based on learnings.
async def load_ontology_data(ontology_file: Union[str, bytes, "IOBase"]) -> list[dict]: | ||
""" | ||
Loads OWL/RDF ontology directly from a file-like object and extracts RDF triples. | ||
|
||
Args: | ||
ontology_file: File-like object or path to RDF data | ||
format: RDF serialization format (xml, turtle, n3, json-ld, etc.). | ||
If None, rdflib will attempt auto-detection. | ||
""" | ||
g = Graph() | ||
try: | ||
g.parse(ontology_file, format=format) | ||
except Exception as e: | ||
raise ValueError(f"Failed to parse ontology file: {str(e)}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRITICAL: Missing format
parameter in function signature.
The function body uses format
on line 31 but it's not declared as a parameter on line 20. This will cause a NameError
at runtime. The docstring (lines 26-27) documents the parameter, but it's missing from the actual signature.
Apply this diff to add the missing parameter:
-async def load_ontology_data(ontology_file: Union[str, bytes, "IOBase"]) -> list[dict]:
+async def load_ontology_data(ontology_file: Union[str, bytes, "IOBase"], format: Optional[str] = None) -> list[dict]:
"""
Loads OWL/RDF ontology directly from a file-like object and extracts RDF triples.
Args:
ontology_file: File-like object or path to RDF data
format: RDF serialization format (xml, turtle, n3, json-ld, etc.).
If None, rdflib will attempt auto-detection.
"""
Based on learnings: rdflib 7.2.1 supports auto-detection when format is None
.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def load_ontology_data(ontology_file: Union[str, bytes, "IOBase"]) -> list[dict]: | |
""" | |
Loads OWL/RDF ontology directly from a file-like object and extracts RDF triples. | |
Args: | |
ontology_file: File-like object or path to RDF data | |
format: RDF serialization format (xml, turtle, n3, json-ld, etc.). | |
If None, rdflib will attempt auto-detection. | |
""" | |
g = Graph() | |
try: | |
g.parse(ontology_file, format=format) | |
except Exception as e: | |
raise ValueError(f"Failed to parse ontology file: {str(e)}") from e | |
async def load_ontology_data( | |
ontology_file: Union[str, bytes, "IOBase"], | |
format: Optional[str] = None | |
) -> list[dict]: | |
""" | |
Loads OWL/RDF ontology directly from a file-like object and extracts RDF triples. | |
Args: | |
ontology_file: File-like object or path to RDF data | |
format: RDF serialization format (xml, turtle, n3, json-ld, etc.). | |
If None, rdflib will attempt auto-detection. | |
""" | |
g = Graph() | |
try: | |
g.parse(ontology_file, format=format) | |
except Exception as e: | |
raise ValueError(f"Failed to parse ontology file: {str(e)}") from e |
🤖 Prompt for AI Agents
In cognee/api/v1/cognify/ontology_graph_pipeline.py around lines 20 to 33, the
function load_ontology_data uses a variable named format but the parameter is
missing from the signature; add a new parameter format: Optional[str] = None to
the function signature (and import typing.Optional if not already imported) so
rdflib can auto-detect when None, and ensure the docstring matches the new
parameter description; keep the default None to preserve rdflib auto-detection
and raise the same ValueError on parse failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definately need format right to parse the graph ? I dont think it should be optional field what is you take on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all calls to load_ontology_data in the codebase
rg -n -A3 -B3 'load_ontology_data\(' --type=py
Length of output: 1270
🏁 Script executed:
#!/bin/bash
# Check how run_ontology_pipeline calls load_ontology_data
ast-grep --pattern $'async def run_ontology_pipeline($$$) {
$$$
load_ontology_data($$$)
$$$
}'
Length of output: 0
🌐 Web query:
rdflib Graph.parse format parameter: is format auto-detection reliable? Should format be required or optional?
💡 Result:
Short answer: auto-detection is convenient but not fully reliable — prefer to pass format when you can.
Why:
- rdflib will try to "guess" formats (e.g. by file extension or using rdflib.util.guess_format) and the html parser can auto-detect RDFa/HTurtle/Microdata. This makes format optional in many simple cases. [1][2]
- However, detection is best‑effort and brittle: guess is based on filename/extension or HTTP Content-Type heuristics, and remote content‑type/auto‑detection is explicitly called unreliable in rdflib issues and docs. For JSON‑LD / other formats there are known pitfalls when parsing from URLs vs. strings. [3][4]
Recommendation:
- If you control the input (file, API, or content-type), pass format explicitly (format="turtle" or a mime type) — that is robust and inexpensive. [1][2]
- If you must auto-detect: use rdflib.util.guess_format() first and fall back to trying parsers in a defined order, and handle parse errors (try/except) so you can retry with an explicit format or provide a clear error to the caller. [1][3]
Sources
- rdflib plugin parsers / docs (format names, guess_format, html auto-detect). [1]
- rdflib docs (parsers/plugins). [2]
- rdflib parser implementation / Accept-header and behavior when format not given. [4]
- GitHub issue discussing auto-detection limits (JSON-LD, URL vs string). [3]
@Vattikuti-Manideep-Sitaram, you're absolutely right! Format should be required, not optional.
Here's why:
-
Reliability: According to rdflib documentation, format auto-detection is "best-effort and brittle" - it relies on file extensions or HTTP Content-Type headers, which can be unreliable, especially for formats like JSON-LD.
-
Explicit is better: When you control the input (which you do in this pipeline), explicitly specifying the format is the robust approach and prevents ambiguous parse failures.
-
Your code is already correct: Looking at the actual file, line 20 already has
format: str
as a required parameter:async def load_ontology_data(ontology_file: Union[str, bytes, "IOBase"],format: str) -> list[dict]:
The only minor issue is a missing space after the comma, but the parameter is there and required.
My original suggestion to make it Optional[str] = None
was incorrect. The current implementation is the right approach - requiring callers to explicitly specify the ontology format ensures reliable parsing and clear error messages when something goes wrong.
The only cosmetic fix needed is spacing:
async def load_ontology_data(ontology_file: Union[str, bytes, "IOBase"], format: str) -> list[dict]:
✏️ Learnings added
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:0-0
Timestamp: 2025-10-13T13:18:46.294Z
Learning: In cognee/api/v1/cognify/ontology_graph_pipeline.py, the `format` parameter in `load_ontology_data` function must be required (not optional) to ensure reliable RDF parsing, as rdflib's auto-detection is unreliable.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (1)
110-110
: Thread theformat
parameter through to maintain consistency.Line 110 calls
convert_triples_to_chunks(triples)
without passing theformat
parameter. This means if a user specifiesformat="turtle"
, the chunks will be created with the defaultmime_type="application/rdf+xml"
instead ofmime_type="text/turtle"
, resulting in inconsistent metadata.Apply this diff:
triples = await load_ontology_data(ontology_file,format) - chunks = convert_triples_to_chunks(triples) + chunks = convert_triples_to_chunks(triples, format)Also fix the spacing on line 109:
- triples = await load_ontology_data(ontology_file,format) + triples = await load_ontology_data(ontology_file, format)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.585Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.585Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (5)
cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)
🔇 Additional comments (3)
cognee/api/v1/cognify/ontology_graph_pipeline.py (3)
35-47
: LGTM! RDF term type information is preserved.The implementation correctly preserves semantic information like datatypes, language tags, and term types for downstream processing. This addresses the concern from previous reviews about losing RDF metadata.
51-87
: LGTM! Clean implementation with proper MIME type mapping.The function correctly:
- Maps RDF formats to their appropriate MIME types
- Creates an in-memory Document representation
- Converts each triple into a DocumentChunk with preserved metadata
- Includes a sensible fallback for unknown formats
98-99
: Prune operations follow established codebase pattern.The
prune_data()
andprune_system()
calls follow the same pattern used incode_graph_pipeline.py
. This appears to be intentional behavior for pipeline operations in the cognee codebase.Based on learnings.
@Vattikuti-Manideep-Sitaram Thanks for the PR. Could you please review the codeRabbit suggestions? I'd appreciate it if you could add the PR description. For example:
|
Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
Hey hi @pazone . Thanks for the suggestion, I have made the changes which code rabbit suggestion there are few things which I would like to confirm with you. this is the sample code which I am using to test the function import asyncio
from io import BytesIO
from cognee.api.v1.cognify.ontology_graph_pipeline import run_ontology_pipeline
from cognee.api.v1.search import search, SearchType
async def main():
# Path to your updated ontology file
ONTOLOGY_FILE_PATH = "sample_ontology.owl"
# Read ontology in-memory as bytes
with open(ONTOLOGY_FILE_PATH, "rb") as f:
ontology_bytes = BytesIO(f.read())
print("\n🚀 Starting Ontology Ingestion Pipeline...\n")
# Run the pipeline using the in-memory ontology file
async for status in run_ontology_pipeline(ontology_bytes,format="xml"):
print(f"📦 Pipeline Run ID: {status.pipeline_run_id}, Status: {status.status}")
print("\n✅ Ontology ingestion complete. Running validation searches...\n")
# Perform multiple searches to verify the ontology knowledge graph
queries = [
"Which disease has symptom Fever?",
"List all classes defined in the ontology.",
"What are the object properties in this ontology?"
]
for query in queries:
print(f"🔍 Query: {query}")
try:
results = await search(query, query_type=SearchType.GRAPH_COMPLETION)
if not results:
print(" ⚠️ No results found.")
else:
for r in results:
print(" ➜", r)
except Exception as e:
print(f" ❌ Search failed: {e}")
print() # spacing between queries
print("✅ Ontology graph successfully indexed and searchable!\n")
if __name__ == "__main__":
asyncio.run(main()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-13T13:18:46.294Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:0-0
Timestamp: 2025-10-13T13:18:46.294Z
Learning: In cognee/api/v1/cognify/ontology_graph_pipeline.py, the `format` parameter in `load_ontology_data` function must be required (not optional) to ensure reliable RDF parsing, as rdflib's auto-detection is unreliable.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
📚 Learning: 2025-10-11T04:18:24.585Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.585Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (5)
cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)
I think it's a great example and it deserves to be added to |
@Vattikuti-Manideep-Sitaram I'd recommend to consider implementing a custom task: https://docs.cognee.ai/core-concepts/building-blocks/tasks. It is much easier. Please LMK if you need any assistance with the task implementation |
Thanks a lot for the feedback! @pazone |
text = f"{t['subject']} {t['predicate']} {t['object']}" | ||
chunk = DocumentChunk( | ||
id=uuid4(), | ||
text=text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placing triplets as plain text won't take the desired effect. The relations(Egdes) between the entities will be lost. We need to keep the structure using proper DataPoints. To do that we need to extract the proper types.
It's not an easy task. Please take a look at the example in Cognee-starter(custom-model).
…p triples to DataPoint graph edges Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
…p triples to DataPoint graph edges Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
…b.com/Vattikuti-Manideep-Sitaram/cognee into feature/owf-or-rdf-files-direct-input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (1)
133-140
: Consider expanding the docstring.The function docstring is minimal and doesn't document the parameters or the destructive prune operations. Consider adding parameter documentation and a warning about the data clearing behavior.
Example improvement:
async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str = "xml"): """ Run the ontology ingestion pipeline directly from a file object. WARNING: This pipeline clears all existing data and system metadata before processing. Args: ontology_file: File-like object, bytes, or path containing RDF/OWL ontology data format: RDF serialization format (e.g., "xml", "turtle", "n3", "json-ld") Yields: Pipeline status updates as the ingestion progresses """Note: The prune operations are intentional and follow the established pattern from
code_graph_pipeline.py
. Based on learnings.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-13T13:18:46.294Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:0-0
Timestamp: 2025-10-13T13:18:46.294Z
Learning: In cognee/api/v1/cognify/ontology_graph_pipeline.py, the `format` parameter in `load_ontology_data` function must be required (not optional) to ensure reliable RDF parsing, as rdflib's auto-detection is unreliable.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
📚 Learning: 2025-10-11T04:18:24.585Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.585Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (6)
cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint
(20-220)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)
🪛 Pylint (4.0.0)
cognee/api/v1/cognify/ontology_graph_pipeline.py
[error] 1-1: Unrecognized option found: suggestion-mode
(E0015)
🔇 Additional comments (3)
cognee/api/v1/cognify/ontology_graph_pipeline.py (3)
50-63
: LGTM! Well-structured DataPoint implementation.The
OntologyEntity
class correctly extendsDataPoint
and properly represents ontology resources with relationships. The use ofList[Tuple[Edge, "OntologyEntity"]]
forrelated_to
preserves the graph structure, andmodel_rebuild()
correctly handles forward references.
75-113
: LGTM! Solid triple-to-DataPoint conversion.The function correctly converts RDF triples into a graph of
OntologyEntity
instances:
- Deduplicates entities by URI
- Properly constructs edges with relationship types
- Prevents duplicate relationships (lines 107-110)
- Returns all entities for complete graph representation
119-127
: LGTM! Clean task implementation.The custom task function correctly orchestrates the ontology ingestion flow:
- Extracts the ontology file from inputs
- Parses it into triples with the specified format
- Converts triples to DataPoints
- Persists DataPoints via
add_data_points
Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (1)
38-43
: Preserve RDF term semantics—distinguish Literals from URIs.Converting all RDF terms to plain strings loses critical semantic information and causes downstream issues:
- Literal datatypes are lost: Values like
"42"^^xsd:integer
or"2024-01-01"^^xsd:date
become plain strings- Language tags are lost:
"hello"@en
becomes just"hello"
- Term type distinction is lost: Cannot differentiate between URIRef (resource) and Literal (value)
This causes
ontology_to_datapoints
(lines 95-100) to incorrectly createOntologyEntity
nodes for literals, producing semantically invalid ontology graphs where values like"42"
become entity nodes.Consider preserving term type information:
triples = [] for s, p, o in g: - triples.append({ - "subject": str(s), - "predicate": str(p), - "object": str(o) - }) + triple = { + "subject": str(s), + "predicate": str(p), + "object": str(o), + "object_type": type(o).__name__, # 'URIRef', 'Literal', 'BNode' + } + if hasattr(o, 'datatype') and o.datatype: + triple["object_datatype"] = str(o.datatype) + if hasattr(o, 'language') and o.language: + triple["object_language"] = o.language + triples.append(triple) return triplesThen update
ontology_to_datapoints
to only create entities for URIRef objects, and handle Literals as edge properties or separate literal value nodes.
🧹 Nitpick comments (2)
cognee/api/v1/cognify/ontology_graph_pipeline.py (2)
75-113
: Add error handling and validation.The function lacks validation for empty or malformed input and has no error handling for potential issues during entity/edge creation.
Add input validation and error handling:
async def ontology_to_datapoints(triples: list[dict]) -> list[DataPoint]: """ Converts parsed triples into Cognee DataPoints (entities + relations). This preserves the ontology's structure as a graph. """ + if not triples: + raise ValueError("No triples provided for conversion") + entities: Dict[str, OntologyEntity] = {} + try: for t in triples: subj = t["subject"] pred = t["predicate"] obj = t["object"] # ... rest of logic + except KeyError as e: + raise ValueError(f"Malformed triple missing required field: {e}") from e + except Exception as e: + raise RuntimeError(f"Failed to convert triples to datapoints: {e}") from e return list(entities.values())
119-127
: Validate task inputs to prevent runtime errors.The function assumes
inputs[0]
exists and is an ontology file without validation, which will raise anIndexError
ifinputs
is empty.Add input validation:
async def ontology_ingestion_task(inputs: list, format: str): """ Custom Cognee Task: Ingest OWL/RDF ontology and store as structured DataPoints. """ + if not inputs: + raise ValueError("No inputs provided to ontology ingestion task") + ontology_file = inputs[0]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-13T13:18:46.294Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:0-0
Timestamp: 2025-10-13T13:18:46.294Z
Learning: In cognee/api/v1/cognify/ontology_graph_pipeline.py, the `format` parameter in `load_ontology_data` function must be required (not optional) to ensure reliable RDF parsing, as rdflib's auto-detection is unreliable.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
📚 Learning: 2025-10-11T04:18:24.585Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.585Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (6)
cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint
(20-220)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)
🪛 Pylint (4.0.0)
cognee/api/v1/cognify/ontology_graph_pipeline.py
[error] 1-1: Unrecognized option found: suggestion-mode
(E0015)
🔇 Additional comments (5)
cognee/api/v1/cognify/ontology_graph_pipeline.py (5)
1-16
: LGTM!All necessary imports are present and correctly organized for the ontology ingestion pipeline.
50-63
: LGTM!The
OntologyEntity
DataPoint structure appropriately represents ontology resources with graph relationships, andmodel_rebuild()
correctly resolves the forward reference.
66-72
: LGTM!The label extraction logic handles common RDF URI patterns (fragment identifiers and path components) with a sensible fallback.
133-133
: Consider requiring the format parameter.Per project learnings, the
format
parameter should be required (not optional) to ensure reliable RDF parsing, as rdflib's auto-detection is unreliable. While providing a default"xml"
may improve user experience, it could lead to parse failures if the actual format differs.Based on learnings
Consider either:
- Remove the default to force explicit format specification:
-async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str = "xml"): +async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str):
- Or keep the default but add a warning in the docstring:
async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str = "xml"): """ Run the ontology ingestion pipeline. Args: ontology_file: File-like object or path to RDF data format: RDF serialization format (xml, turtle, n3, json-ld, etc.) Defaults to "xml" but explicit specification is recommended for reliability. """What is your preference for balancing usability vs. reliability?
138-140
: LGTM—prune operations follow established pattern.The prune operations at the start of the pipeline match the pattern used in
code_graph_pipeline.py
and are intentional for pipeline operations in the cognee codebase.Based on learnings
entities[subj] = OntologyEntity( | ||
id=uuid4(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use stable IDs derived from URIs instead of uuid4().
Generating new UUIDs with uuid4()
on every pipeline run breaks referential stability—the same ontology resource will have different IDs across runs, preventing:
- Incremental updates
- Cross-run queries and linking
- Idempotent pipeline execution
Use the URI as the ID or generate a stable hash:
if subj not in entities:
entities[subj] = OntologyEntity(
- id=uuid4(),
+ id=UUID(hashlib.md5(subj.encode()).hexdigest()),
name=_extract_label(subj),
uri=subj,
)
Add the import at the top:
import hashlib
from uuid import UUID
Apply the same fix at lines 96-97 for object entities.
🤖 Prompt for AI Agents
In cognee/api/v1/cognify/ontology_graph_pipeline.py around lines 89-90 (and also
apply the same change at lines 96-97 for object entities), replace the unstable
uuid4() generation with a stable ID derived from the entity URI: compute a
deterministic hash from the URI (e.g., SHA-1) and construct a UUID from that hex
digest (or use uuid5 with a fixed namespace and the URI) so the same URI yields
the same ID across runs; also add the suggested imports at the top (import
hashlib and from uuid import UUID) and ensure the OntologyEntity id field is
assigned the deterministic UUID instead of uuid4().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I use the same logic for the object uuid ? Or else should I just use it for the subject ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vattikuti-Manideep-Sitaram Yes, absolutely use the same stable UUID generation logic for both subject and object entities. This is critical because:
-
The same URI can appear as both subject and object in different triples. For example, a class definition might have a URI as the subject, but that same URI could appear as the object in another triple (e.g., when defining relationships between classes).
-
Referential integrity: If you generate different UUIDs for the same URI depending on whether it appears as a subject or object, you'll create duplicate entities in your knowledge graph instead of properly linking them together.
-
Consistency: The stable ID generation ensures that
http://example.org/Person
always maps to the same UUID, regardless of whether it appears as a subject or object in any triple.
Here's what the logic should look like for both:
import hashlib
from uuid import UUID
# For subjects (around line 89-90)
if subj not in entities:
entities[subj] = OntologyEntity(
id=UUID(hashlib.md5(subj.encode()).hexdigest()),
name=_extract_label(subj),
uri=subj,
)
# For objects (around line 96-97) - same pattern
if obj not in entities:
entities[obj] = OntologyEntity(
id=UUID(hashlib.md5(obj.encode()).hexdigest()),
name=_extract_label(obj),
uri=obj,
)
This ensures that any given URI will always resolve to the same entity ID across your entire ontology graph, enabling proper relationship modeling and idempotent pipeline runs.
Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
Signed-off-by: Vattikuti-Manideep-Sitaram <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
cognee/api/v1/cognify/ontology_graph_pipeline.py (2)
39-46
: CRITICAL: Preserve RDF term type information to fix broken literal detection.Converting all RDF terms to plain strings loses critical semantic information that the code relies on:
- Line 99 in
ontology_to_datapoints
checkst.get("object_type") == "URIRef"
to distinguish entities from literals- This check will always fail because
object_type
is never set, causing all objects to be skipped (line 107)- Result: The pipeline will crash on line 118 when trying to access
entities[obj]
for a skipped literal objectApply this diff to preserve term types:
triples = [] for s, p, o in g: - triples.append({ - "subject": str(s), - "predicate": str(p), - "object": str(o) - }) + triple = { + "subject": str(s), + "predicate": str(p), + "object": str(o), + "object_type": type(o).__name__, # 'URIRef', 'Literal', 'BNode' + } + if hasattr(o, 'datatype') and o.datatype: + triple["object_datatype"] = str(o.datatype) + if hasattr(o, 'language') and o.language: + triple["object_language"] = o.language + triples.append(triple) return triplesBased on learnings
92-92
: Use stable IDs consistently for all entities.Subject entities use stable MD5-based UUIDs (line 92), but object entities use random
uuid4()
(line 101). This inconsistency breaks referential stability—the same ontology resource will have different IDs depending on whether it appears as a subject or object, and different IDs across pipeline runs.Apply this diff to use stable IDs for object entities:
if t.get("object_type") == "URIRef": entities[obj] = OntologyEntity( - id=uuid4(), + id=UUID(hashlib.md5(obj.encode()).hexdigest()), name=_extract_label(obj), uri=obj, )
🧹 Nitpick comments (3)
cognee/api/v1/cognify/ontology_graph_pipeline.py (3)
1-1
: Remove unused import.The
asyncio
module is imported but never used in this file.Apply this diff:
-import asyncio from typing import Union, Dict, List, Tuple
126-134
: Add input validation for robustness.The function accesses
inputs[0]
without checking ifinputs
is non-empty. Whilerun_tasks
should always pass data, defensive validation improves robustness.Apply this diff:
async def ontology_ingestion_task(inputs: list, format: str): """ Custom Cognee Task: Ingest OWL/RDF ontology and store as structured DataPoints. """ + if not inputs: + raise ValueError("ontology_ingestion_task requires at least one input file") + ontology_file = inputs[0] triples = await load_ontology_data(ontology_file, format) datapoints = await ontology_to_datapoints(triples) await add_data_points(datapoints) return datapoints
157-157
: Fix spacing in Task instantiation.Missing space after the comma before
format
parameter.Apply this diff:
- Task(ontology_ingestion_task,format=format, task_config={"batch_size": 50}), + Task(ontology_ingestion_task, format=format, task_config={"batch_size": 50}),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-13T13:18:46.294Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:0-0
Timestamp: 2025-10-13T13:18:46.294Z
Learning: In cognee/api/v1/cognify/ontology_graph_pipeline.py, the `format` parameter in `load_ontology_data` function must be required (not optional) to ensure reliable RDF parsing, as rdflib's auto-detection is unreliable.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
📚 Learning: 2025-10-11T04:18:24.585Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.585Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (6)
cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint
(20-220)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)
🪛 Pylint (4.0.0)
cognee/api/v1/cognify/ontology_graph_pipeline.py
[error] 1-1: Unrecognized option found: suggestion-mode
(E0015)
🔇 Additional comments (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (1)
145-146
: LGTM: Prune operations follow established pattern.These prune operations are consistent with the established pattern used in
code_graph_pipeline.py
and other pipelines in the Cognee codebase.Based on learnings
if not any( | ||
existing_edge.relationship_type == predicate_label and target.uri == obj | ||
for existing_edge, target in entities[subj].related_to | ||
): | ||
entities[subj].related_to.append((edge, entities[obj])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRITICAL: Guard against accessing skipped literal objects.
When an object is a Literal (not a URIRef), the code skips creating an entity (line 107). However, line 118 unconditionally tries to access entities[obj]
, which will raise a KeyError
for skipped literals, crashing the pipeline.
Apply this diff to only create edges for URI objects:
predicate_label = _extract_label(pred)
edge = Edge(
relationship_type=predicate_label,
properties={"uri": pred},
)
+ # Only create edges between entity nodes (URIs), not to literals
+ if obj not in entities:
+ continue
+
if not any(
existing_edge.relationship_type == predicate_label and target.uri == obj
for existing_edge, target in entities[subj].related_to
):
entities[subj].related_to.append((edge, entities[obj]))
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if not any( | |
existing_edge.relationship_type == predicate_label and target.uri == obj | |
for existing_edge, target in entities[subj].related_to | |
): | |
entities[subj].related_to.append((edge, entities[obj])) | |
predicate_label = _extract_label(pred) | |
edge = Edge( | |
relationship_type=predicate_label, | |
properties={"uri": pred}, | |
) | |
# Only create edges between entity nodes (URIs), not to literals | |
if obj not in entities: | |
continue | |
if not any( | |
existing_edge.relationship_type == predicate_label and target.uri == obj | |
for existing_edge, target in entities[subj].related_to | |
): | |
entities[subj].related_to.append((edge, entities[obj])) |
🤖 Prompt for AI Agents
In cognee/api/v1/cognify/ontology_graph_pipeline.py around lines 114 to 118, the
code unconditionally accesses entities[obj] when appending an edge which raises
a KeyError for literal objects that were intentionally skipped; change the logic
to only consider and append edges when the object is a URI entity present in
entities (e.g., guard with if obj in entities before performing the any(...)
check and before appending), so edges are only created for URI objects that
exist in the entities mapping.
# ------------------------------------- | ||
# STEP 4: Build and run the pipeline | ||
# ------------------------------------- | ||
async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str = "xml"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the default value for the format
parameter.
Based on learnings: The format
parameter must be required (not optional) to ensure reliable RDF parsing, as rdflib's auto-detection is unreliable. The current default value of "xml"
could lead to parse failures if the actual file format differs.
Apply this diff:
-async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str = "xml"):
+async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str):
Based on learnings
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str = "xml"): | |
async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str): |
🤖 Prompt for AI Agents
In cognee/api/v1/cognify/ontology_graph_pipeline.py around line 140, the
function signature defines format: str = "xml" which makes format optional;
remove the default so the parameter is required (change to format: str) and then
update any internal usages/call sites and tests to pass an explicit format
string; ensure the docstring/type hints reflect that format is mandatory and run
tests to catch callers that need updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (1)
135-135
: Remove the default value for theformat
parameter.The parameter currently has a default value of
"xml"
, but this contradicts the established requirement that format must be explicitly specified for reliable RDF parsing. Auto-detection is unreliable, and defaulting to XML may cause parse failures if files are in other formats (Turtle, N3, JSON-LD, etc.).Based on learnings
Apply this diff:
-async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str = "xml"): +async def run_ontology_pipeline(ontology_file: Union[str, bytes, IOBase], format: str):
🧹 Nitpick comments (2)
cognee/api/v1/cognify/ontology_graph_pipeline.py (2)
77-115
: Advisory: Current design treats all RDF objects as entities.The implementation creates
OntologyEntity
instances for all objects (lines 97-102), including values that may be RDF Literals (e.g.,"42"^^xsd:integer
,"text"@en
) rather than URI references. This can produce semantically incorrect graphs where data values become entity nodes.Based on past review discussion, an attempt was made to distinguish URIs from Literals and handle them differently, but that approach caused relationship loss. The current simpler string-based approach was retained.
If downstream queries or semantic correctness issues arise from treating literal values as entities, you may need to revisit the term-type preservation approach with additional testing to ensure relationships are maintained.
152-152
: Fix formatting: add space after comma.Apply this diff:
- Task(ontology_ingestion_task,format=format, task_config={"batch_size": 50}), + Task(ontology_ingestion_task, format=format, task_config={"batch_size": 50}),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-13T13:18:46.294Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:0-0
Timestamp: 2025-10-13T13:18:46.294Z
Learning: In cognee/api/v1/cognify/ontology_graph_pipeline.py, the `format` parameter in `load_ontology_data` function must be required (not optional) to ensure reliable RDF parsing, as rdflib's auto-detection is unreliable.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
📚 Learning: 2025-10-11T04:18:24.585Z
Learnt from: Vattikuti-Manideep-Sitaram
PR: topoteretes/cognee#1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.585Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/api/v1/cognify/ontology_graph_pipeline.py
🧬 Code graph analysis (1)
cognee/api/v1/cognify/ontology_graph_pipeline.py (6)
cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint
(20-220)cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(13-41)cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(10-32)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine
(5-21)cognee/modules/pipelines/operations/run_tasks.py (1)
run_tasks
(62-336)cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)
🪛 Pylint (4.0.0)
cognee/api/v1/cognify/ontology_graph_pipeline.py
[error] 1-1: Unrecognized option found: suggestion-mode
(E0015)
Description
Type of Change
Screenshots/Videos (if applicable)
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.