Skip to content

HERD minor cleanups: dead assignment, repeated to_dataframe in loop, O(n^2) lookups #1503

Description

@rly

A few minor, non-blocking cleanups noticed while reviewing HERD (src/hdmf/common/resources.py). None change behavior for correct usage; grouping them here rather than filing separately.

1. Dead assignment in add_ref (resources.py:679-682)

if entity_uri is not None:
    entity_uri = entity.entity_uri          # value is never read afterward
    msg = 'This entity already exists. Ignoring new entity uri'
    warn(msg, stacklevel=3)

In this branch the entity already exists, so add_entity is False and _add_entity is never called. The reassigned entity_uri is never used again. The line can be removed (keep the warning).

2. to_dataframe rebuilds the files DataFrame inside a loop (resources.py:954-955)

for idx in result_df['files_idx']:
    file_id_val = self.files.to_dataframe().iloc[int(idx)]['file_object_id']
    file_id_col.append(file_id_val)

self.files.to_dataframe() is recomputed on every iteration. Hoist it out of the loop (or map via the already-available files DataFrame).

3. O(n^2) population via repeated linear which() scans

add_ref / _check_object_field / to_dataframe rely on repeated linear Table.which() scans (e.g. _check_object_field scans the objects table up to three times per call). Building a large HERD is therefore roughly O(n^2). Not a bug and fine for small tables, but worth noting for scalability if HERD is used at scale (e.g. an index/cache keyed by object_id/key would remove the repeated scans).

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions