Skip to content

Conversation

@Al4ise
Copy link
Collaborator

@Al4ise Al4ise commented Oct 29, 2025

Description by Korbit AI

What change is being made?

Add a new AWS Lambda entrypoint (aws/backtest_data_lambda.py) to populate LumiBot backtest cache by fetching data from ThetaData, normalising it to the local cache format, and uploading to S3, along with supporting helper changes and tests to wire remote cache requests and payload contracts.

Why are these changes being made?

Introduce a working path to populate and synchronize backtest cache data from ThetaData into S3, enabling cache misses to be fulfilled remotely while keeping the deployment lightweight and provider-agnostic in the payload contract. The changes also expand test coverage and mocks to validate remote cache interactions and payload contents.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

Copy link
Contributor

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Performance Inefficient nested loop with repeated filtering ▹ view
Documentation Incomplete internal function docstring ▹ view
Functionality Precision 0 excluded from payload ▹ view
Readability Unstructured column drop list ▹ view
Readability Unclear null check logic ▹ view
Performance Inefficient row-wise apply operation ▹ view
Files scanned
File Path Reviewed
aws/backtest_data_lambda.py
lumibot/tools/thetadata_helper.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

Comment on lines +347 to +350
for idx in df.index:
day_minutes = minute_df[minute_df["date"] == idx.date()]
if not day_minutes.empty and "open" in df.columns:
df.loc[idx, "open"] = day_minutes.iloc[0]["open"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient nested loop with repeated filtering category Performance

Tell me more
What is the issue?

Nested loop with repeated DataFrame filtering operations has O(n*m) time complexity, where filtering is performed for each EOD row against the entire minute DataFrame.

Why this matters

This approach scales poorly with large datasets, as each iteration performs a full scan of the minute DataFrame, leading to quadratic time complexity and excessive Lambda execution time.

Suggested change ∙ Feature Preview

Use groupby and merge operations for better performance:

if not minute_df.empty and "open" in df.columns:
    first_minutes = minute_df.groupby("date").first()["open"]
    df["open"] = df.index.map(lambda x: first_minutes.get(x.date(), df.loc[x, "open"]))

Or use a more efficient merge-based approach:

if not minute_df.empty and "open" in df.columns:
    df_with_date = df.copy()
    df_with_date["date"] = df_with_date.index.date
    first_minutes = minute_df.groupby("date").first().reset_index()
    merged = df_with_date.merge(first_minutes[["date", "open"]], on="date", how="left", suffixes=("", "_minute"))
    df["open"] = merged["open_minute"].fillna(df["open"])
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +33 to +34
def _normalize_datetime_for_payload(value: DatetimeLike) -> str:
"""Return a UTC ISO-8601 string for the provided date/datetime."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete internal function docstring category Documentation

Tell me more
What is the issue?

The docstring only describes what the function returns, but not what it accepts or why it's used.

Why this matters

Without understanding the purpose and input requirements, developers may misuse the function or have to read the implementation to understand its constraints.

Suggested change ∙ Feature Preview

"""Convert a date/datetime to UTC ISO-8601 string for API payloads.

Args:
    value: A date or datetime object to convert

Returns:
    str: UTC ISO-8601 formatted string
"""
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

if hasattr(asset, "multiplier") and asset.multiplier != 1:
payload["multiplier"] = asset.multiplier

if hasattr(asset, "precision") and asset.precision:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precision 0 excluded from payload category Functionality

Tell me more
What is the issue?

The condition checks if asset.precision is truthy, which will fail for precision=0 (a valid precision value).

Why this matters

When precision is 0, the condition evaluates to False and the precision won't be included in the payload, potentially causing incorrect data handling for assets that legitimately have 0 precision.

Suggested change ∙ Feature Preview

Change the condition to explicitly check for None:

if hasattr(asset, "precision") and asset.precision is not None:
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +316 to +328
drop_cols = [
"ms_of_day",
"ms_of_day2",
"date",
"bid_size",
"bid_exchange",
"bid",
"bid_condition",
"ask_size",
"ask_exchange",
"ask",
"ask_condition",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unstructured column drop list category Readability

Tell me more
What is the issue?

The drop_cols list mixes different categories of columns (time-related and bid/ask-related) without visual separation or grouping.

Why this matters

Mixed categories make it harder to understand the purpose of each column being dropped and to maintain the list over time.

Suggested change ∙ Feature Preview
time_cols = ["ms_of_day", "ms_of_day2", "date"]
bid_cols = ["bid_size", "bid_exchange", "bid", "bid_condition"]
ask_cols = ["ask_size", "ask_exchange", "ask", "ask_condition"]

drop_cols = [
    *time_cols,
    *bid_cols,
    *ask_cols,
]
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.


responses.append(payload.get("response", []))
next_page = header.get("next_page") if header else None
if not next_page or next_page in ("", "null"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear null check logic category Readability

Tell me more
What is the issue?

The condition mixes Python None check with string-based null checks in a non-obvious way.

Why this matters

The mix of falsy check and explicit string comparisons makes the intent of valid/invalid next_page values unclear and could lead to maintenance issues.

Suggested change ∙ Feature Preview
INVALID_NEXT_PAGE = {"", "null", None}
if next_page in INVALID_NEXT_PAGE:
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

if df.empty:
return df

df["datetime"] = df.apply(_combine_date_ms, axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient row-wise apply operation category Performance

Tell me more
What is the issue?

Using DataFrame.apply() with axis=1 for row-wise operations is significantly slower than vectorized operations, especially for large datasets.

Why this matters

The apply() function with axis=1 iterates through each row individually, which can be orders of magnitude slower than vectorized pandas operations for large DataFrames, impacting Lambda execution time and costs.

Suggested change ∙ Feature Preview

Replace with vectorized operations:

base_dates = pd.to_datetime(df["date"].astype(str), format="%Y%m%d")
ms_deltas = pd.to_timedelta(df["ms_of_day"], unit="ms")
df["datetime"] = base_dates + ms_deltas
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants