Skip to content

Conversation

cherukum-Amazon
Copy link

@cherukum-Amazon cherukum-Amazon commented Oct 16, 2025

  1. Backport bug fixes to Valkey 9.0

List of commits backported -

Valkey 9.0 (view)

  1. Update version.h
  2. Update release notes to prepare for Valkey 9.0 GA release

@cherukum-Amazon cherukum-Amazon marked this pull request as draft October 16, 2025 23:18
Copy link

codecov bot commented Oct 16, 2025

Codecov Report

❌ Patch coverage is 94.96403% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.44%. Comparing base (6700272) to head (b556657).

Files with missing lines Patch % Lines
src/replication.c 25.00% 3 Missing ⚠️
src/scripting_engine.c 62.50% 3 Missing ⚠️
src/functions.c 96.87% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              9.0    #2744      +/-   ##
==========================================
- Coverage   72.58%   72.44%   -0.15%     
==========================================
  Files         128      128              
  Lines       71241    71298      +57     
==========================================
- Hits        51709    51649      -60     
- Misses      19532    19649     +117     
Files with missing lines Coverage Δ
src/blocked.c 91.32% <100.00%> (ø)
src/cluster_migrateslots.c 92.07% <ø> (+0.44%) ⬆️
src/db.c 93.24% <100.00%> (ø)
src/eval.c 89.23% <100.00%> (ø)
src/kvstore.c 96.01% <100.00%> (ø)
src/lazyfree.c 92.56% <100.00%> (+6.17%) ⬆️
src/lua/engine_lua.c 88.12% <100.00%> (-0.22%) ⬇️
src/lua/function_lua.c 98.24% <100.00%> (-0.46%) ⬇️
src/networking.c 88.46% <100.00%> (+0.14%) ⬆️
src/server.h 100.00% <ø> (ø)
... and 4 more

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cherukum-Amazon cherukum-Amazon marked this pull request as ready for review October 16, 2025 23:39
@cherukum-Amazon
Copy link
Author

@valkey-io/core-team

@zuiderkwast zuiderkwast added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Oct 17, 2025
@rjd15372
Copy link
Member

@cherukum-Amazon please include #1826, which will be merged today.

@hpatro hpatro requested a review from a team October 17, 2025 17:48
@hpatro
Copy link
Collaborator

hpatro commented Oct 17, 2025

@cherukum-Amazon Could you also bring this in #2741 ?

00-RELEASENOTES Outdated
Comment on lines 17 to 19
Upgrade urgency LOW: This is the ga release candidate of Valkey 9.0.0,
focused on stability, bug fixes, and incremental improvements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the wording for 8.0 GA. We could just have the same structure.

Suggested change
Upgrade urgency LOW: This is the ga release candidate of Valkey 9.0.0,
focused on stability, bug fixes, and incremental improvements.
Upgrade urgency LOW: This is the first release of Valkey 9.0 which
includes stability, bug fixes, and incremental improvements over the third release candidate.

00-RELEASENOTES Outdated
Comment on lines 20 to 23
## Bug fixes
* HSETEX with FXX should not create an object if it does not exist (#2716)
* Fix crash when aborting a slot migration while child snapshot is active (#2721)
* Fix double MOVED reply on unblock at failover (#2734)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the complete set. Needs to be updated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be added #2362

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated per comment.

@cherukum-Amazon
Copy link
Author

@cherukum-Amazon please include #1826, which will be merged today.

This is complete.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update version.h file as well. https://github.com/valkey-io/valkey/blob/9.0/src/version.h#L13

@cherukum-Amazon
Copy link
Author

We need to update version.h file as well. https://github.com/valkey-io/valkey/blob/9.0/src/version.h#L13

0c9f48a - this commit has the change, is there anything missing?.

@hpatro
Copy link
Collaborator

hpatro commented Oct 18, 2025

We need to update version.h file as well. https://github.com/valkey-io/valkey/blob/9.0/src/version.h#L13

0c9f48a - this commit has the change, is there anything missing?.

Weird, I don't see it in the file diff section on Github.

@murphyjacob4
Copy link
Contributor

Sorry it's last minute - can we add #2749 ?

@cherukum-Amazon
Copy link
Author

Sorry it's last minute - can we add #2749 ?

Done.

cherukum-Amazon and others added 12 commits October 18, 2025 18:14
Signed-off-by: cherukum-amazon <[email protected]>
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.

Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
    #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
    valkey-io#1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
    valkey-io#2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
    valkey-io#3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
    valkey-io#4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
    valkey-io#5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
    valkey-io#6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
    valkey-io#7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
    valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
    valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
    valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
    valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
    valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
    valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
    valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
    valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
    valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```

Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
(cherry picked from commit 155b0bb)
Signed-off-by: cherukum-amazon <[email protected]>
* Add cross version compatibility test to run with Valkey 7.2 and 8.0
* Add mechanism in TCL test to skip tests dynamically - valkey-io#2711

---------

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
(cherry picked from commit 18214be)
Signed-off-by: cherukum-amazon <[email protected]>
…key-io#2716)

When the hash object does not exist FXX should simply fail the check
without creating the object while FNX should be trivial and succeed.

Note - also fix a potential compilation warning on some COMPILERS doing
constant folding of variable length array when size is const expression.

Signed-off-by: Ran Shidlansik <[email protected]>
(cherry picked from commit 8182f4a)
Signed-off-by: cherukum-amazon <[email protected]>
…child snapshot is active (valkey-io#2721)

The race condition causes the client to be used and subsequently double
freed by the slot migration read pipe handler. The order of events is:

1. We kill the slot migration child process during CANCELSLOTMIGRATIONS
1. We then free the associated client to the target node
1. Although we kill the child process, it is not guaranteed that the
pipe will be empty from child to parent
1. If the pipe is not empty, we later will read that out in the
slotMigrationPipeReadHandler
1. In the pipe read handler, we attempt to write to the connection. If
writing to the connection fails, we will attempt to free the client
1. However, the client was already freed, so this a double free

Notably, the slot migration being aborted doesn't need to be triggered
by `CANCELSLOTMIGRATIONS`, it can be any failure.

To solve this, we simply:

1. Set the slot migration pipe connection to NULL whenever it is
unlinked
2. Bail out early in slot migration pipe read handler if the connection
is NULL

I also consolidate the killSlotMigrationChild call to one code path,
which is executed on client unlink. Before, there were two code paths
that would do this twice (once on slot migration job finish, and once on
client unlink). Sending the signal twice is fine, but inefficient.

Also, add a test to cancel during the slot migration snapshot to make
sure this case is covered (we only caught it during the module test).

---------

Signed-off-by: Jacob Murphy <[email protected]>
(cherry picked from commit 28e5dcc)
Signed-off-by: cherukum-amazon <[email protected]>
This test was failing, and causing the next test to throw an exception.
It is failing since we never waited for the slot migration to connect
before proceeding.

Fixes valkey-io#2692

Signed-off-by: Jacob Murphy <[email protected]>
(cherry picked from commit 19474c8)
Signed-off-by: cherukum-amazon <[email protected]>
DEBUG LOADAOF sometimes works but it results in -LOADING responses to
the primary so there are lots of race conditions. It isn't something we
should be doing anyways. To test, I just disconnect the replica before
loading the AOF, then reconnect it.

Fixes valkey-io#2712

Signed-off-by: Jacob Murphy <[email protected]>
(cherry picked from commit dbcf022)
Signed-off-by: cherukum-amazon <[email protected]>
valkey-io#2329 intoduced a bug that causes a blocked client in cluster mode to
receive two MOVED redirects instead of one. This was not seen in tests,
except in the reply schema validator.

The fix makes sure the client's pending command is cleared after sending
the MOVED redirect, to prevent if from being reprocessed.

Fixes valkey-io#2676.

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
(cherry picked from commit 54da834)
Signed-off-by: cherukum-amazon <[email protected]>
…load crash (valkey-io#1826)

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```

The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ricardo Dias <[email protected]>
Co-authored-by: Ricardo Dias <[email protected]>
(cherry picked from commit b4c93cc)
Signed-off-by: cherukum-amazon <[email protected]>
increased the wait time to a total of 10 seconds where we check the log
for `Done loading RDB` message

Fixes valkey-io#2694

CI run (100 times):
https://github.com/roshkhatri/valkey/actions/runs/18576201712/job/52961907806

Signed-off-by: Roshan Khatri <[email protected]>
(cherry picked from commit 898172b)
Signed-off-by: cherukum-amazon <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
(cherry picked from commit 95154fe)
Signed-off-by: cherukum-amazon <[email protected]>
…on (valkey-io#2749)

When working on valkey-io#2635 I errorneously duplicated the
setSlotImportingStateInAllDbs call for successful imports. This resulted
in us doubling the key count in the kvstore. This results in DBSIZE
reporting an incorrect sum, and also causes BIT corruption that can
eventually result in a crash.

The solution is:

1. Only call setSlotImportingStateInAllDbs once (in our
finishSlotMigrationJob function)
2. Make setSlotImportingStateInAllDbs idempotent by checking if the
delete from the kvstore importing hashtable is a no-op

This also fixes a bug where the number of importing keys is not lowered
after the migration, but this is less critical since it is only used
when resizing the dictionary on RDB load. However, it could result in
un-loadable RDBs if the importing key count gets large enough.

Signed-off-by: Jacob Murphy <[email protected]>
(cherry picked from commit 2a914aa)
Signed-off-by: cherukum-amazon <[email protected]>
Signed-off-by: cherukum-amazon <[email protected]>

Update Release notes for 9.0.0-GA

Signed-off-by: cherukum-amazon <[email protected]>

Signed-off-by: cherukum-amazon <[email protected]>
@rjd15372
Copy link
Member

Another late bug fix: #2750

Can we get it in?

* Fix crash when aborting a slot migration while child snapshot is active (#2721)
* Fix double MOVED reply on unblock at failover (#2734)
* Fix memory leak with CLIENT LIST/KILL duplicate filters (#2362)
* Fix incorrect kvstore size and BIT accounting after completed migration (#2749)
Copy link
Contributor

@zuiderkwast zuiderkwast Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is hard to understand. I'm suggesting a change.

Also I think we should add #1826 and #2750 (together).

Suggested change
* Fix incorrect kvstore size and BIT accounting after completed migration (#2749)
* Fix incorrect accounting after completed atomic slot migration (#2749)
* Fix Lua VM crash after FUNCTION FLUSH ASYNC + FUNCTION LOAD (#1826, #2750)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants