Caching S3 WIP #882

Al4ise · 2025-10-29T16:43:12Z

Description by Korbit AI

What change is being made?

Add a new AWS Lambda entrypoint (aws/backtest_data_lambda.py) to populate LumiBot backtest cache by fetching data from ThetaData, normalising it to the local cache format, and uploading to S3, along with supporting helper changes and tests to wire remote cache requests and payload contracts.

Why are these changes being made?

Introduce a working path to populate and synchronize backtest cache data from ThetaData into S3, enabling cache misses to be fulfilled remotely while keeping the deployment lightweight and provider-agnostic in the payload contract. The changes also expand test coverage and mocks to validate remote cache interactions and payload contents.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

korbit-ai

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Category	Issue	Status
	Inefficient nested loop with repeated filtering ▹ view
	Incomplete internal function docstring ▹ view
	Precision 0 excluded from payload ▹ view
	Unstructured column drop list ▹ view
	Unclear null check logic ▹ view
	Inefficient row-wise apply operation ▹ view

Files scanned

File Path	Reviewed
aws/backtest_data_lambda.py	✅
lumibot/tools/thetadata_helper.py	✅

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

korbit-ai · 2025-10-29T16:47:13Z

aws/backtest_data_lambda.py

+            for idx in df.index:
+                day_minutes = minute_df[minute_df["date"] == idx.date()]
+                if not day_minutes.empty and "open" in df.columns:
+                    df.loc[idx, "open"] = day_minutes.iloc[0]["open"]


Inefficient nested loop with repeated filtering

Tell me more

What is the issue?

Nested loop with repeated DataFrame filtering operations has O(n*m) time complexity, where filtering is performed for each EOD row against the entire minute DataFrame.

Why this matters

This approach scales poorly with large datasets, as each iteration performs a full scan of the minute DataFrame, leading to quadratic time complexity and excessive Lambda execution time.

Suggested change ∙ Feature Preview

Use groupby and merge operations for better performance:

if not minute_df.empty and "open" in df.columns: first_minutes = minute_df.groupby("date").first()["open"] df["open"] = df.index.map(lambda x: first_minutes.get(x.date(), df.loc[x, "open"]))

Or use a more efficient merge-based approach:

if not minute_df.empty and "open" in df.columns: df_with_date = df.copy() df_with_date["date"] = df_with_date.index.date first_minutes = minute_df.groupby("date").first().reset_index() merged = df_with_date.merge(first_minutes[["date", "open"]], on="date", how="left", suffixes=("", "_minute")) df["open"] = merged["open_minute"].fillna(df["open"])

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-10-29T16:47:13Z

lumibot/tools/thetadata_helper.py

+def _normalize_datetime_for_payload(value: DatetimeLike) -> str:
+    """Return a UTC ISO-8601 string for the provided date/datetime."""


Incomplete internal function docstring

Tell me more

What is the issue?

The docstring only describes what the function returns, but not what it accepts or why it's used.

Why this matters

Without understanding the purpose and input requirements, developers may misuse the function or have to read the implementation to understand its constraints.

Suggested change ∙ Feature Preview

"""Convert a date/datetime to UTC ISO-8601 string for API payloads.

Args: value: A date or datetime object to convert Returns: str: UTC ISO-8601 formatted string """

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-10-29T16:47:13Z

lumibot/tools/thetadata_helper.py

+    if hasattr(asset, "multiplier") and asset.multiplier != 1:
+        payload["multiplier"] = asset.multiplier
+
+    if hasattr(asset, "precision") and asset.precision:


Precision 0 excluded from payload

Tell me more

What is the issue?

The condition checks if asset.precision is truthy, which will fail for precision=0 (a valid precision value).

Why this matters

When precision is 0, the condition evaluates to False and the precision won't be included in the payload, potentially causing incorrect data handling for assets that legitimately have 0 precision.

Suggested change ∙ Feature Preview

Change the condition to explicitly check for None:

if hasattr(asset, "precision") and asset.precision is not None:

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-10-29T16:47:14Z

aws/backtest_data_lambda.py

+    drop_cols = [
+        "ms_of_day",
+        "ms_of_day2",
+        "date",
+        "bid_size",
+        "bid_exchange",
+        "bid",
+        "bid_condition",
+        "ask_size",
+        "ask_exchange",
+        "ask",
+        "ask_condition",
+    ]


Unstructured column drop list

Tell me more

What is the issue?

The drop_cols list mixes different categories of columns (time-related and bid/ask-related) without visual separation or grouping.

Why this matters

Mixed categories make it harder to understand the purpose of each column being dropped and to maintain the list over time.

Suggested change ∙ Feature Preview

time_cols = ["ms_of_day", "ms_of_day2", "date"] bid_cols = ["bid_size", "bid_exchange", "bid", "bid_condition"] ask_cols = ["ask_size", "ask_exchange", "ask", "ask_condition"] drop_cols = [ *time_cols, *bid_cols, *ask_cols, ]

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-10-29T16:47:14Z

aws/backtest_data_lambda.py

+
+        responses.append(payload.get("response", []))
+        next_page = header.get("next_page") if header else None
+        if not next_page or next_page in ("", "null"):


Unclear null check logic

Tell me more

What is the issue?

The condition mixes Python None check with string-based null checks in a non-obvious way.

Why this matters

The mix of falsy check and explicit string comparisons makes the intent of valid/invalid next_page values unclear and could lead to maintenance issues.

Suggested change ∙ Feature Preview

INVALID_NEXT_PAGE = {"", "null", None} if next_page in INVALID_NEXT_PAGE:

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-10-29T16:47:14Z

aws/backtest_data_lambda.py

+    if df.empty:
+        return df
+
+    df["datetime"] = df.apply(_combine_date_ms, axis=1)


Inefficient row-wise apply operation

Tell me more

What is the issue?

Using DataFrame.apply() with axis=1 for row-wise operations is significantly slower than vectorized operations, especially for large datasets.

Why this matters

The apply() function with axis=1 iterates through each row individually, which can be orders of magnitude slower than vectorized pandas operations for large DataFrames, impacting Lambda execution time and costs.

Suggested change ∙ Feature Preview

Replace with vectorized operations:

base_dates = pd.to_datetime(df["date"].astype(str), format="%Y%m%d") ms_deltas = pd.to_timedelta(df["ms_of_day"], unit="ms") df["datetime"] = base_dates + ms_deltas

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

caching done

f52c4b4

Al4ise requested a review from grzesir as a code owner October 29, 2025 16:43

Al4ise had a problem deploying to unit-tests October 29, 2025 16:43 — with GitHub Actions Failure

korbit-ai bot reviewed Oct 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Caching S3 WIP #882

Caching S3 WIP #882

Uh oh!

Al4ise commented Oct 29, 2025 •

edited by korbit-ai bot

Loading

Uh oh!

korbit-ai bot left a comment •

edited

Loading

Uh oh!

korbit-ai bot Oct 29, 2025

Uh oh!

korbit-ai bot Oct 29, 2025

Uh oh!

korbit-ai bot Oct 29, 2025

Uh oh!

korbit-ai bot Oct 29, 2025

Uh oh!

korbit-ai bot Oct 29, 2025

Uh oh!

korbit-ai bot Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def _normalize_datetime_for_payload(value: DatetimeLike) -> str:
		"""Return a UTC ISO-8601 string for the provided date/datetime."""

Caching S3 WIP #882

Are you sure you want to change the base?

Caching S3 WIP #882

Uh oh!

Conversation

Al4ise commented Oct 29, 2025 • edited by korbit-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description by Korbit AI

What change is being made?

Why are these changes being made?

Uh oh!

korbit-ai bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Uh oh!

korbit-ai bot Oct 29, 2025

Choose a reason for hiding this comment

Inefficient nested loop with repeated filtering

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

korbit-ai bot Oct 29, 2025

Choose a reason for hiding this comment

Incomplete internal function docstring

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

korbit-ai bot Oct 29, 2025

Choose a reason for hiding this comment

Precision 0 excluded from payload

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

korbit-ai bot Oct 29, 2025

Choose a reason for hiding this comment

Unstructured column drop list

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

korbit-ai bot Oct 29, 2025

Choose a reason for hiding this comment

Unclear null check logic

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

korbit-ai bot Oct 29, 2025

Choose a reason for hiding this comment

Inefficient row-wise apply operation

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Al4ise commented Oct 29, 2025 •

edited by korbit-ai bot

Loading

korbit-ai bot left a comment •

edited

Loading