Skip to content

Commit 427f6e8

Browse files
authored
Automatically propagate/condense slots when parsing/writing. (#629)
* Enable automatic propagation and condensation. Add a `propagate` parameter to all `parse_*` functions. When that parameter is true, the parsing function will automatically propagate all condensed slots in the parsed mapping set, prior to returning a MappingSetDataFrame. For the `parse_obographs_json` function, the parameter does not actually make much sense (a mapping set extracted from a OBOGraph-JSON document cannot be in a condensed state), but having all parsing functions accept the same `propagate` parameter allows to keep using the `get_parsing_function` logic without having to make a special case for the functions that would not accept a `propagate` parameter. Likewise for the `parse_alignment_xml` function. The `propagate` parameter defaults to True -- this is the behaviour recommended by the SSSOM specification --, except for the aforementioned `parse_obographs_json` and `parse_alignment_xml` functions. Propagation is normally performed by calling the `MappingSetDataFrame#propagate()` method, but in the case of the `parse_sssom_table()` function this cannot work, because propagation needs to be performed before we can get a MappingSetDataFrame instance (otherwise we could fail to construct a MappingSetDataFrame if the mapping set contains literal mappings and the `subject_type` or `object_type` slot is condensed). So we use a new `propagate_condensed_slots()` method instead, which does not require a MappingSetDataFrame object. Conversely, we also add a `condense` parameter to all the `write_*` functions, to perform the opposite operation prior to writing a set. The parameter defaults to True for all writing functions, except `write_rdf` since writing condensed slots in RDF is most likely not wanted. * Update tests for the new behaviour of parsing functions. Now that (most) parsing functions default to propagate condensed slots, some tests need to be updated. Parsing-time propagation must be disabled for: * some tests that are performed on "half-condensed" test files and do not expect the structure of the set to be modified; * the tests for the actual propagation/condensation feature (can't properly test propagation if the parser already did it for us). In addition, we add a test to check that parsing-time condensation does allow to successfully read a set containing literal mappings when the object_type slot is condensed (#606). * Add --[no-]propagate and --[no-]condense options. Now that parsing and writing functions default to automatically propagate and condense (respectively) sets, we add `--no-propagate` and `--no-condense` options to some commands to alter the default behaviour. Specifically, we add such options to: * `sssom convert`, * `sssom parse`, * `sssom validate`, * and `sssom merge`. It is currently not deemed useful (by me at least) to add those options to every single command.
1 parent 8e2f8b7 commit 427f6e8

File tree

9 files changed

+180
-45
lines changed

9 files changed

+180
-45
lines changed

src/sssom/cli.py

Lines changed: 36 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -103,13 +103,20 @@
103103
default=("subject_category", "object_category"),
104104
help="Fields.",
105105
)
106-
107106
predicate_filter_option = click.option(
108107
"-F",
109108
"--mapping-predicate-filter",
110109
multiple=True,
111110
help="A list of predicates or a file path containing the list of predicates to be considered.",
112111
)
112+
propagate_option = click.option(
113+
"--propagate/--no-propagate",
114+
default=True,
115+
help="Automatically propagate condensed slots upon parsing.",
116+
)
117+
condense_option = click.option(
118+
"--condense/--no-condense", default=True, help="Automatically condense slots upon writing."
119+
)
113120

114121

115122
@click.group()
@@ -148,9 +155,19 @@ def help(ctx: click.Context, subcommand: str) -> None:
148155
@input_argument
149156
@output_option
150157
@output_format_option
151-
def convert(input: str, output: TextIO, output_format: str) -> None:
158+
@propagate_option
159+
@condense_option
160+
def convert(
161+
input: str, output: TextIO, output_format: str, propagate: bool, condense: bool
162+
) -> None:
152163
"""Convert a file."""
153-
convert_file(input_path=input, output=output, output_format=output_format)
164+
convert_file(
165+
input_path=input,
166+
output=output,
167+
output_format=output_format,
168+
propagate=propagate,
169+
condense=condense,
170+
)
154171

155172

156173
# Input and metadata would be files (file paths). Check if exists.
@@ -193,6 +210,8 @@ def convert(input: str, output: TextIO, output_format: str) -> None:
193210
)
194211
@predicate_filter_option
195212
@output_option
213+
@propagate_option
214+
@condense_option
196215
def parse(
197216
input: str,
198217
input_format: str,
@@ -203,6 +222,8 @@ def parse(
203222
output: TextIO,
204223
embedded_mode: bool,
205224
mapping_predicate_filter: list[str],
225+
propagate: bool,
226+
condense: bool,
206227
) -> None:
207228
"""Parse a file in one of the supported formats (such as obographs) into an SSSOM TSV file."""
208229
parse_file(
@@ -215,6 +236,8 @@ def parse(
215236
strict_clean_prefixes=strict_clean_prefixes,
216237
embedded_mode=embedded_mode,
217238
mapping_predicate_filter=mapping_predicate_filter,
239+
propagate=propagate,
240+
condense=condense,
218241
)
219242

220243

@@ -227,10 +250,11 @@ def parse(
227250
multiple=True,
228251
default=DEFAULT_VALIDATION_TYPES,
229252
)
230-
def validate(input: str, validation_types: List[SchemaValidationType]) -> None:
253+
@propagate_option
254+
def validate(input: str, validation_types: List[SchemaValidationType], propagate: bool) -> None:
231255
"""Produce an error report for an SSSOM file."""
232256
validation_type_list = [t for t in validation_types]
233-
validate_file(input_path=input, validation_types=validation_type_list)
257+
validate_file(input_path=input, validation_types=validation_type_list, propagate=propagate)
234258

235259

236260
@main.command()
@@ -522,11 +546,15 @@ def correlations(input: str, output: TextIO, transpose: bool, fields: Tuple[str,
522546
help="If true, the deduplicate (i.e., remove redundant lower confidence mappings) and reconcile (if msdf contains a higher confidence _negative_ mapping, then remove lower confidence positive one. If confidence is the same, prefer HumanCurated. If both HumanCurated, prefer negative mapping)",
523547
)
524548
@output_option
525-
def merge(inputs: str, output: TextIO, reconcile: bool = False) -> None:
549+
@propagate_option
550+
@condense_option
551+
def merge(
552+
inputs: str, output: TextIO, propagate: bool, condense: bool, reconcile: bool = False
553+
) -> None:
526554
"""Merge multiple MappingSetDataFrames into one.""" # noqa: DAR101
527-
msdfs = [parse_sssom_table(i) for i in inputs]
555+
msdfs = [parse_sssom_table(i, propagate=propagate) for i in inputs]
528556
merged_msdf = merge_msdf(*msdfs, reconcile=reconcile)
529-
write_table(merged_msdf, output)
557+
write_table(merged_msdf, output, condense=condense)
530558

531559

532560
@main.command(

src/sssom/io.py

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,18 +43,22 @@ def convert_file(
4343
input_path: str,
4444
output: TextIO,
4545
output_format: Optional[str] = None,
46+
propagate: bool = True,
47+
condense: bool = True,
4648
) -> None:
4749
"""Convert a file from one format to another.
4850
4951
:param input_path: The path to the input SSSOM tsv file
5052
:param output: The path to the output file. If none is given, will default to using stdout.
5153
:param output_format: The format to which the SSSOM TSV should be converted.
54+
:param propagate: Propagate condensed slots in the input file.
55+
:param condense: Condense slots in the output file.
5256
"""
5357
raise_for_bad_path(input_path)
54-
doc = parse_sssom_table(input_path)
58+
doc = parse_sssom_table(input_path, propagate=propagate)
5559
write_func, fileformat = get_writer_function(output_format=output_format, output=output)
5660
# TODO cthoyt figure out how to use protocols for this
57-
write_func(doc, output, serialisation=fileformat) # type:ignore
61+
write_func(doc, output, serialisation=fileformat, condense=condense) # type:ignore
5862

5963

6064
def parse_file(
@@ -68,6 +72,8 @@ def parse_file(
6872
strict_clean_prefixes: bool = True,
6973
embedded_mode: bool = True,
7074
mapping_predicate_filter: RecursivePathList | None = None,
75+
propagate: bool = True,
76+
condense: bool = True,
7177
) -> None:
7278
"""Parse an SSSOM metadata file and write to a table.
7379
@@ -86,6 +92,8 @@ def parse_file(
8692
(tsv), else two separate files (tsv and yaml).
8793
:param mapping_predicate_filter: Optional list of mapping predicates or filepath containing the
8894
same.
95+
:param propagate: If true, propagate all condensed slots in the input set.
96+
:param condense: If true, condense slots in the output set.
8997
"""
9098
raise_for_bad_path(input_path)
9199
converter, meta = _get_converter_and_metadata(
@@ -102,31 +110,34 @@ def parse_file(
102110
prefix_map=converter,
103111
meta=meta,
104112
mapping_predicates=mapping_predicates,
113+
propagate=propagate,
105114
)
106115
if clean_prefixes:
107116
# We do this because we got a lot of prefixes from the default SSSOM prefixes!
108117
doc.clean_prefix_map(strict=strict_clean_prefixes)
109-
write_table(doc, output, embedded_mode)
118+
write_table(doc, output, embedded_mode, condense=condense)
110119

111120

112121
def validate_file(
113122
input_path: str,
114123
validation_types: Optional[List[SchemaValidationType]] = None,
115124
fail_on_error: bool = True,
125+
propagate: bool = True,
116126
) -> dict[SchemaValidationType, ValidationReport]:
117127
"""Validate the incoming SSSOM TSV according to the SSSOM specification.
118128
119129
:param input_path: The path to the input file in one of the legal formats, eg obographs,
120130
aligmentapi-xml
121131
:param validation_types: A list of validation types to run.
122132
:param fail_on_error: Should an exception be raised on error of _any_ validator?
133+
:param propagate: If true, propagate condensed slots in the input set.
123134
124135
:returns: A dictionary from validation types to validation reports
125136
"""
126137
# Two things to check:
127138
# 1. All prefixes in the DataFrame are define in prefix_map
128139
# 2. All columns in the DataFrame abide by sssom-schema.
129-
msdf = parse_sssom_table(file_path=input_path)
140+
msdf = parse_sssom_table(file_path=input_path, propagate=propagate)
130141
return validate(msdf=msdf, validation_types=validation_types, fail_on_error=fail_on_error)
131142

132143

src/sssom/parsers.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@
8787
MappingSetDataFrame,
8888
get_file_extension,
8989
is_multivalued_slot,
90+
propagate_condensed_slots,
9091
raise_for_bad_path,
9192
safe_compress,
9293
to_mapping_set_dataframe,
@@ -304,6 +305,7 @@ def parse_sssom_table(
304305
*,
305306
strict: bool = False,
306307
sep: Optional[str] = None,
308+
propagate: bool = True,
307309
**kwargs: Any,
308310
) -> MappingSetDataFrame:
309311
"""Parse a SSSOM CSV or TSV file.
@@ -315,6 +317,7 @@ def parse_sssom_table(
315317
within the document itself. For example, this may come from a companion SSSOM YAML file.
316318
:param strict: If true, will fail parsing for undefined prefixes, CURIEs, or IRIs
317319
:param sep: The seperator. If not given, inferred from file name
320+
:param propagate: If true, propagate all condensed slots.
318321
:param kwargs: Additional keyword arguments (unhandled)
319322
320323
:returns: A parsed dataframe wrapper object
@@ -358,6 +361,9 @@ def parse_sssom_table(
358361
)
359362
)
360363

364+
if propagate:
365+
propagate_condensed_slots(df, combine_meta)
366+
361367
msdf = from_sssom_dataframe(df, prefix_map=converter, meta=combine_meta)
362368
return msdf
363369

@@ -379,6 +385,7 @@ def parse_sssom_rdf(
379385
prefix_map: ConverterHint = None,
380386
meta: Optional[MetadataType] = None,
381387
serialisation: str = SSSOM_DEFAULT_RDF_SERIALISATION,
388+
propagate: bool = True,
382389
**kwargs: Any,
383390
# mapping_predicates: Optional[List[str]] = None,
384391
) -> MappingSetDataFrame:
@@ -406,6 +413,8 @@ def parse_sssom_rdf(
406413
]
407414
)
408415
msdf = from_sssom_rdf(g, prefix_map=converter, meta=meta)
416+
if propagate:
417+
msdf.propagate()
409418
# df: pd.DataFrame = msdf.df
410419
# if mapping_predicates and not df.empty():
411420
# msdf.df = df[df["predicate_id"].isin(mapping_predicates)]
@@ -416,6 +425,7 @@ def parse_sssom_json(
416425
file_path: Union[str, Path],
417426
prefix_map: ConverterHint = None,
418427
meta: Optional[MetadataType] = None,
428+
propagate: bool = True,
419429
**kwargs: Any,
420430
) -> MappingSetDataFrame:
421431
"""Parse a TSV to a :class:`MappingSetDocument` to a :class:`MappingSetDataFrame`."""
@@ -443,6 +453,8 @@ def parse_sssom_json(
443453
)
444454

445455
msdf = from_sssom_json(jsondoc=jsondoc, prefix_map=converter, meta=meta)
456+
if propagate:
457+
msdf.propagate()
446458
return msdf
447459

448460

@@ -454,13 +466,15 @@ def parse_obographs_json(
454466
prefix_map: ConverterHint = None,
455467
meta: Optional[MetadataType] = None,
456468
mapping_predicates: Optional[List[str]] = None,
469+
propagate: bool = False,
457470
) -> MappingSetDataFrame:
458471
"""Parse an obographs file as a JSON object and translates it into a MappingSetDataFrame.
459472
460473
:param file_path: The path to the obographs file
461474
:param prefix_map: an optional prefix map
462475
:param meta: an optional dictionary of metadata elements
463476
:param mapping_predicates: an optional list of mapping predicates that should be extracted
477+
:param propagate: If true, propagate all condensed slots.
464478
465479
:returns: A SSSOM MappingSetDataFrame
466480
"""
@@ -471,12 +485,15 @@ def parse_obographs_json(
471485
with open(file_path) as json_file:
472486
jsondoc = json.load(json_file)
473487

474-
return from_obographs(
488+
msdf = from_obographs(
475489
jsondoc,
476490
prefix_map=converter,
477491
meta=meta,
478492
mapping_predicates=mapping_predicates,
479493
)
494+
if propagate:
495+
msdf.propagate()
496+
return msdf
480497

481498

482499
def _get_prefix_map_and_metadata(
@@ -539,6 +556,7 @@ def parse_alignment_xml(
539556
prefix_map: ConverterHint = None,
540557
meta: Optional[MetadataType] = None,
541558
mapping_predicates: Optional[List[str]] = None,
559+
propagate: bool = False,
542560
) -> MappingSetDataFrame:
543561
"""Parse a TSV -> MappingSetDocument -> MappingSetDataFrame."""
544562
raise_for_bad_path(file_path)
@@ -552,6 +570,8 @@ def parse_alignment_xml(
552570
meta=meta,
553571
mapping_predicates=mapping_predicates,
554572
)
573+
if propagate:
574+
msdf.propagate()
555575
return msdf
556576

557577

src/sssom/util.py

Lines changed: 46 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -342,31 +342,7 @@ def propagate(self, fill_empty: bool = False) -> List[str]:
342342
343343
:returns: The list of slots that were effectively propagated.
344344
"""
345-
schema = SSSOMSchemaView()
346-
propagated = []
347-
348-
for slot in schema.propagatable_slots:
349-
if slot not in self.metadata: # Nothing to propagate
350-
continue
351-
is_present = slot in self.df.columns
352-
if is_present and not fill_empty:
353-
logging.warning(
354-
f"Not propagating value for '{slot}' because the slot is already set on individual records."
355-
)
356-
continue
357-
358-
if schema.view.get_slot(slot).multivalued:
359-
value = "|".join(self.metadata.pop(slot))
360-
else:
361-
value = self.metadata.pop(slot)
362-
363-
if is_present:
364-
self.df.loc[self.df[slot].eq("") | self.df[slot].isna(), slot] = value
365-
else:
366-
self.df[slot] = value
367-
propagated.append(slot)
368-
369-
return propagated
345+
return propagate_condensed_slots(self.df, self.metadata, fill_empty)
370346

371347
def condense(self) -> List[str]:
372348
"""Condense record-level slot values to the set whenever possible.
@@ -1306,6 +1282,51 @@ def deal_with_negation(df: pd.DataFrame) -> pd.DataFrame:
13061282
return return_df
13071283

13081284

1285+
def propagate_condensed_slots(
1286+
df: pd.DataFrame, metadata: MetadataType, fill_empty: bool = False
1287+
) -> List[str]:
1288+
"""Propagate slot values from the set level down to individual records.
1289+
1290+
This function performs the same operation as the `MappingSetDataFrame#propagate()`
1291+
method. It is intended to allow propagating a mapping set before an instance of
1292+
the `MappingSetDataFrame` class can be obtained.
1293+
1294+
:param df: The DataFrame into which values should be propagated.
1295+
:param metadata: The dictionary of set-level metadata.
1296+
:param fill_empty: If True, propagation of a slot is allowed even if some individual records
1297+
already have a value for that slot. The set-level value will be propagated to all the
1298+
records for which the slot is empty. Note that (1) this is not spec-compliant behaviour,
1299+
and (2) this makes the operation non-reversible by a subsequent condensation.
1300+
1301+
:returns: The list of slots that were effectively propagated.
1302+
"""
1303+
schema = SSSOMSchemaView()
1304+
propagated = []
1305+
1306+
for slot in schema.propagatable_slots:
1307+
if slot not in metadata: # Nothing to propagate
1308+
continue
1309+
is_present = slot in df.columns
1310+
if is_present and not fill_empty:
1311+
logging.warning(
1312+
f"Not propagating value for '{slot}' because the slot is already set on individual records."
1313+
)
1314+
continue
1315+
1316+
if schema.view.get_slot(slot).multivalued:
1317+
value = "|".join(metadata.pop(slot))
1318+
else:
1319+
value = metadata.pop(slot)
1320+
1321+
if is_present:
1322+
df.loc[df[slot].eq("") | df[slot].isna(), slot] = value
1323+
else:
1324+
df[slot] = value
1325+
propagated.append(slot)
1326+
1327+
return propagated
1328+
1329+
13091330
ExtensionLiteral = Literal["tsv", "csv"]
13101331

13111332

0 commit comments

Comments
 (0)