Skip to content

Conversation

fealho
Copy link
Member

@fealho fealho commented Oct 6, 2025

CU-86b6fugf3, Resolve #2661
CU-86b6fugf3, Resolve #2663
CU-86b6xkdcf, Resolve #2686
CU-86b6xjuk7, Resolve #2685
CU-86b6xjugh, Resolve #2684
CU-86b6xrcb1, Resolve #2689
CU-86b6xrqf0, Resolve #2691
CU-86b6xnx26, Resolve #2687
CU-86b6xp7a0, Resolve #2688
CU-86b6xrcah, Resolve #2690

Copy link

codecov bot commented Oct 6, 2025

Codecov Report

❌ Patch coverage is 96.31336% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.10%. Comparing base (7a720b2) to head (78fee6b).

Files with missing lines Patch % Lines
sdv/datasets/demo.py 96.27% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2705      +/-   ##
==========================================
- Coverage   98.15%   98.10%   -0.06%     
==========================================
  Files          74       74              
  Lines        7826     7989     +163     
==========================================
+ Hits         7682     7838     +156     
- Misses        144      151       +7     
Flag Coverage Δ
integration 75.94% <76.03%> (-0.39%) ⬇️
unit 96.84% <92.62%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fealho fealho marked this pull request as ready for review October 16, 2025 16:44
@fealho fealho requested a review from a team as a code owner October 16, 2025 16:44
@fealho fealho requested review from sarahmish and removed request for a team October 16, 2025 16:44
@fealho fealho requested review from amontanez24 and pvk-developer and removed request for sarahmish October 16, 2025 16:45
_validate_modalities(modality)
if output_filepath is not None and not str(output_filepath).endswith('.txt'):
fname = (filename or '').lower()
file_type = 'README' if 'readme' in fname else 'source'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the S3 folder they are lower case. @amontanez24 Which one should we follow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you post on team engineering? The docs do say capital, but we would have to run a script to rename all the files. Let's see what Neha thinks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvk-developer @amontanez24 The code is already case insensitive. As for the comment above, it's just a matter of what error message we want to show. I'll keep as is if that's alright.

Comment on lines 389 to 421
try:
raw = _get_data_from_bucket(yaml_key)
info = yaml.safe_load(raw) or {}

size_mb_val = info.get('dataset-size-mb')
try:
size_mb = float(size_mb_val) if size_mb_val is not None else np.nan
except (ValueError, TypeError):
LOGGER.info(
f'Invalid dataset-size-mb {size_mb_val} for dataset '
f'{dataset_name}; defaulting to NaN.'
)
size_mb = np.nan

num_tables_val = info.get('num-tables', np.nan)
if isinstance(num_tables_val, str):
try:
num_tables_val = float(num_tables_val)
except (ValueError, TypeError):
LOGGER.info(
f'Could not cast num_tables_val {num_tables_val} to float for '
f'dataset {dataset_name}; defaulting to NaN.'
)
num_tables_val = np.nan

try:
num_tables = int(num_tables_val) if not pd.isna(num_tables_val) else np.nan
except (ValueError, TypeError):
LOGGER.info(
f'Invalid num-tables {num_tables_val} for '
f'dataset {dataset_name} when parsing as int.'
)
num_tables = np.nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be broken down into smaller functions that do their own thing and in a similar fashion as the download.

Also, avoid the try/except in a try/except and specially in such large portions. What if tables_info['size_MB'] list was not defined ? Would we know which thing failed in this block of code ?

            raw = _get_data_from_bucket(yaml_key)
            info = yaml.safe_load(raw) or {}
            size_mb_val = info.get('dataset-size-mb')
            ......
            tables_info['dataset_name'].append(dataset_name)
            tables_info['size_MB'].append(size_mb)
            tables_info['num_tables'].append(num_tables)

@fealho fealho requested a review from pvk-developer October 17, 2025 16:59
@fealho fealho force-pushed the feature-branch-download-demo branch from 80cfbdb to 78fee6b Compare October 17, 2025 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment