-
Notifications
You must be signed in to change notification settings - Fork 239
Caching S3 WIP #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Caching S3 WIP #882
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
| Category | Issue | Status |
|---|---|---|
| Inefficient nested loop with repeated filtering ▹ view | ||
| Incomplete internal function docstring ▹ view | ||
| Precision 0 excluded from payload ▹ view | ||
| Unstructured column drop list ▹ view | ||
| Unclear null check logic ▹ view | ||
| Inefficient row-wise apply operation ▹ view |
Files scanned
| File Path | Reviewed |
|---|---|
| aws/backtest_data_lambda.py | ✅ |
| lumibot/tools/thetadata_helper.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
| for idx in df.index: | ||
| day_minutes = minute_df[minute_df["date"] == idx.date()] | ||
| if not day_minutes.empty and "open" in df.columns: | ||
| df.loc[idx, "open"] = day_minutes.iloc[0]["open"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient nested loop with repeated filtering 
Tell me more
What is the issue?
Nested loop with repeated DataFrame filtering operations has O(n*m) time complexity, where filtering is performed for each EOD row against the entire minute DataFrame.
Why this matters
This approach scales poorly with large datasets, as each iteration performs a full scan of the minute DataFrame, leading to quadratic time complexity and excessive Lambda execution time.
Suggested change ∙ Feature Preview
Use groupby and merge operations for better performance:
if not minute_df.empty and "open" in df.columns:
first_minutes = minute_df.groupby("date").first()["open"]
df["open"] = df.index.map(lambda x: first_minutes.get(x.date(), df.loc[x, "open"]))Or use a more efficient merge-based approach:
if not minute_df.empty and "open" in df.columns:
df_with_date = df.copy()
df_with_date["date"] = df_with_date.index.date
first_minutes = minute_df.groupby("date").first().reset_index()
merged = df_with_date.merge(first_minutes[["date", "open"]], on="date", how="left", suffixes=("", "_minute"))
df["open"] = merged["open_minute"].fillna(df["open"])Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| def _normalize_datetime_for_payload(value: DatetimeLike) -> str: | ||
| """Return a UTC ISO-8601 string for the provided date/datetime.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete internal function docstring 
Tell me more
What is the issue?
The docstring only describes what the function returns, but not what it accepts or why it's used.
Why this matters
Without understanding the purpose and input requirements, developers may misuse the function or have to read the implementation to understand its constraints.
Suggested change ∙ Feature Preview
"""Convert a date/datetime to UTC ISO-8601 string for API payloads.
Args:
value: A date or datetime object to convert
Returns:
str: UTC ISO-8601 formatted string
"""
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| if hasattr(asset, "multiplier") and asset.multiplier != 1: | ||
| payload["multiplier"] = asset.multiplier | ||
|
|
||
| if hasattr(asset, "precision") and asset.precision: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precision 0 excluded from payload 
Tell me more
What is the issue?
The condition checks if asset.precision is truthy, which will fail for precision=0 (a valid precision value).
Why this matters
When precision is 0, the condition evaluates to False and the precision won't be included in the payload, potentially causing incorrect data handling for assets that legitimately have 0 precision.
Suggested change ∙ Feature Preview
Change the condition to explicitly check for None:
if hasattr(asset, "precision") and asset.precision is not None:Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| drop_cols = [ | ||
| "ms_of_day", | ||
| "ms_of_day2", | ||
| "date", | ||
| "bid_size", | ||
| "bid_exchange", | ||
| "bid", | ||
| "bid_condition", | ||
| "ask_size", | ||
| "ask_exchange", | ||
| "ask", | ||
| "ask_condition", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unstructured column drop list 
Tell me more
What is the issue?
The drop_cols list mixes different categories of columns (time-related and bid/ask-related) without visual separation or grouping.
Why this matters
Mixed categories make it harder to understand the purpose of each column being dropped and to maintain the list over time.
Suggested change ∙ Feature Preview
time_cols = ["ms_of_day", "ms_of_day2", "date"]
bid_cols = ["bid_size", "bid_exchange", "bid", "bid_condition"]
ask_cols = ["ask_size", "ask_exchange", "ask", "ask_condition"]
drop_cols = [
*time_cols,
*bid_cols,
*ask_cols,
]Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
|
|
||
| responses.append(payload.get("response", [])) | ||
| next_page = header.get("next_page") if header else None | ||
| if not next_page or next_page in ("", "null"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear null check logic 
Tell me more
What is the issue?
The condition mixes Python None check with string-based null checks in a non-obvious way.
Why this matters
The mix of falsy check and explicit string comparisons makes the intent of valid/invalid next_page values unclear and could lead to maintenance issues.
Suggested change ∙ Feature Preview
INVALID_NEXT_PAGE = {"", "null", None}
if next_page in INVALID_NEXT_PAGE:Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| if df.empty: | ||
| return df | ||
|
|
||
| df["datetime"] = df.apply(_combine_date_ms, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient row-wise apply operation 
Tell me more
What is the issue?
Using DataFrame.apply() with axis=1 for row-wise operations is significantly slower than vectorized operations, especially for large datasets.
Why this matters
The apply() function with axis=1 iterates through each row individually, which can be orders of magnitude slower than vectorized pandas operations for large DataFrames, impacting Lambda execution time and costs.
Suggested change ∙ Feature Preview
Replace with vectorized operations:
base_dates = pd.to_datetime(df["date"].astype(str), format="%Y%m%d")
ms_deltas = pd.to_timedelta(df["ms_of_day"], unit="ms")
df["datetime"] = base_dates + ms_deltasProvide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
Description by Korbit AI
What change is being made?
Add a new AWS Lambda entrypoint (aws/backtest_data_lambda.py) to populate LumiBot backtest cache by fetching data from ThetaData, normalising it to the local cache format, and uploading to S3, along with supporting helper changes and tests to wire remote cache requests and payload contracts.
Why are these changes being made?
Introduce a working path to populate and synchronize backtest cache data from ThetaData into S3, enabling cache misses to be fulfilled remotely while keeping the deployment lightweight and provider-agnostic in the payload contract. The changes also expand test coverage and mocks to validate remote cache interactions and payload contents.