-
Notifications
You must be signed in to change notification settings - Fork 928
Backport bug fixes to Valkey 9.0 and prepare for Valkey 9.0 GA release #2744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 9.0
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 9.0 #2744 +/- ##
==========================================
- Coverage 72.58% 72.44% -0.15%
==========================================
Files 128 128
Lines 71241 71298 +57
==========================================
- Hits 51709 51649 -60
- Misses 19532 19649 +117
🚀 New features to boost your workflow:
|
@valkey-io/core-team |
@cherukum-Amazon please include #1826, which will be merged today. |
@cherukum-Amazon Could you also bring this in #2741 ? |
00-RELEASENOTES
Outdated
Upgrade urgency LOW: This is the ga release candidate of Valkey 9.0.0, | ||
focused on stability, bug fixes, and incremental improvements. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the wording for 8.0 GA. We could just have the same structure.
Upgrade urgency LOW: This is the ga release candidate of Valkey 9.0.0, | |
focused on stability, bug fixes, and incremental improvements. | |
Upgrade urgency LOW: This is the first release of Valkey 9.0 which | |
includes stability, bug fixes, and incremental improvements over the third release candidate. |
00-RELEASENOTES
Outdated
## Bug fixes | ||
* HSETEX with FXX should not create an object if it does not exist (#2716) | ||
* Fix crash when aborting a slot migration while child snapshot is active (#2721) | ||
* Fix double MOVED reply on unblock at failover (#2734) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the complete set. Needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be added #2362
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated per comment.
This is complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update version.h file as well. https://github.com/valkey-io/valkey/blob/9.0/src/version.h#L13
0c9f48a - this commit has the change, is there anything missing?. |
Weird, I don't see it in the file diff section on Github. |
Sorry it's last minute - can we add #2749 ? |
Done. |
Signed-off-by: cherukum-amazon <[email protected]>
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL subcommand. The intended behavior was to pick the last value of the filter. However, we introduced memory leak for all the preceding filters. Before this change: ``` > CLIENT LIST IP 127.0.0.1 IP 127.0.0.1 id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0 ``` Leak: ``` Direct leak of 11 byte(s) in 1 object(s) allocated from: #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d) valkey-io#1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156 valkey-io#2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200 valkey-io#3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113 valkey-io#4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264 valkey-io#5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600 valkey-io#6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772 valkey-io#7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434 valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571 valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702 valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812 valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79 valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301 valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486 valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543 valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319 valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139) ``` Note: For filter ID / NOT-ID we group all the option and perform filtering whereas for remaining filters we only pick the last filter option. --------- Signed-off-by: Harkrishn Patro <[email protected]> (cherry picked from commit 155b0bb) Signed-off-by: cherukum-amazon <[email protected]>
* Add cross version compatibility test to run with Valkey 7.2 and 8.0 * Add mechanism in TCL test to skip tests dynamically - valkey-io#2711 --------- Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> (cherry picked from commit 18214be) Signed-off-by: cherukum-amazon <[email protected]>
…key-io#2716) When the hash object does not exist FXX should simply fail the check without creating the object while FNX should be trivial and succeed. Note - also fix a potential compilation warning on some COMPILERS doing constant folding of variable length array when size is const expression. Signed-off-by: Ran Shidlansik <[email protected]> (cherry picked from commit 8182f4a) Signed-off-by: cherukum-amazon <[email protected]>
…child snapshot is active (valkey-io#2721) The race condition causes the client to be used and subsequently double freed by the slot migration read pipe handler. The order of events is: 1. We kill the slot migration child process during CANCELSLOTMIGRATIONS 1. We then free the associated client to the target node 1. Although we kill the child process, it is not guaranteed that the pipe will be empty from child to parent 1. If the pipe is not empty, we later will read that out in the slotMigrationPipeReadHandler 1. In the pipe read handler, we attempt to write to the connection. If writing to the connection fails, we will attempt to free the client 1. However, the client was already freed, so this a double free Notably, the slot migration being aborted doesn't need to be triggered by `CANCELSLOTMIGRATIONS`, it can be any failure. To solve this, we simply: 1. Set the slot migration pipe connection to NULL whenever it is unlinked 2. Bail out early in slot migration pipe read handler if the connection is NULL I also consolidate the killSlotMigrationChild call to one code path, which is executed on client unlink. Before, there were two code paths that would do this twice (once on slot migration job finish, and once on client unlink). Sending the signal twice is fine, but inefficient. Also, add a test to cancel during the slot migration snapshot to make sure this case is covered (we only caught it during the module test). --------- Signed-off-by: Jacob Murphy <[email protected]> (cherry picked from commit 28e5dcc) Signed-off-by: cherukum-amazon <[email protected]>
This test was failing, and causing the next test to throw an exception. It is failing since we never waited for the slot migration to connect before proceeding. Fixes valkey-io#2692 Signed-off-by: Jacob Murphy <[email protected]> (cherry picked from commit 19474c8) Signed-off-by: cherukum-amazon <[email protected]>
DEBUG LOADAOF sometimes works but it results in -LOADING responses to the primary so there are lots of race conditions. It isn't something we should be doing anyways. To test, I just disconnect the replica before loading the AOF, then reconnect it. Fixes valkey-io#2712 Signed-off-by: Jacob Murphy <[email protected]> (cherry picked from commit dbcf022) Signed-off-by: cherukum-amazon <[email protected]>
valkey-io#2329 intoduced a bug that causes a blocked client in cluster mode to receive two MOVED redirects instead of one. This was not seen in tests, except in the reply schema validator. The fix makes sure the client's pending command is cleared after sending the MOVED redirect, to prevent if from being reprocessed. Fixes valkey-io#2676. --------- Signed-off-by: Viktor Söderqvist <[email protected]> (cherry picked from commit 54da834) Signed-off-by: cherukum-amazon <[email protected]>
…load crash (valkey-io#1826) There will be two issues in this test: ``` test {FUNCTION - test function flush} { for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } set before_flush_memory [s used_memory_vm_functions] r function flush sync set after_flush_memory [s used_memory_vm_functions] puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory" for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } set before_flush_memory [s used_memory_vm_functions] r function flush async set after_flush_memory [s used_memory_vm_functions] puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory" for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } puts "Test done" } ``` The first one is the test output, we can see that after executing FUNCTION FLUSH, used_memory_vm_functions has not changed at all: ``` flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432 flush async, before_flush_memory: 4504576, after_flush_memory: 4504576 ``` The second one is there is a crash when loading the functions during the async flush: ``` === VALKEY BUG REPORT START: Cut & paste starting from here === # valkey 255.255.255 crashed by signal: 11, si_code: 2 # Accessing address: 0xe0429b7100000a3c # Crashed running the instruction at: 0x102e0b09c ------ STACK TRACE ------ EIP: 0 valkey-server 0x0000000102e0b09c luaH_getstr + 52 Backtrace: 0 libsystem_platform.dylib 0x000000018b066584 _sigtramp + 56 1 valkey-server 0x0000000102e01054 luaD_precall + 96 2 valkey-server 0x0000000102e01b10 luaD_call + 104 3 valkey-server 0x0000000102e00d1c luaD_rawrunprotected + 76 4 valkey-server 0x0000000102e01e3c luaD_pcall + 60 5 valkey-server 0x0000000102dfc630 lua_pcall + 300 6 valkey-server 0x0000000102f77770 luaEngineCompileCode + 708 7 valkey-server 0x0000000102f71f50 scriptingEngineCallCompileCode + 104 8 valkey-server 0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088 9 valkey-server 0x0000000102f70898 functionLoadCommand + 312 10 valkey-server 0x0000000102e3978c call + 416 11 valkey-server 0x0000000102e3b5b8 processCommand + 3340 12 valkey-server 0x0000000102e563cc processInputBuffer + 520 13 valkey-server 0x0000000102e55808 readQueryFromClient + 92 14 valkey-server 0x0000000102f696e0 connSocketEventHandler + 180 15 valkey-server 0x0000000102e20480 aeProcessEvents + 372 16 valkey-server 0x0000000102e4aad0 main + 26412 17 dyld 0x000000018acab154 start + 2476 ------ STACK TRACE DONE ------ ``` The reason is that, in the old implementation (introduced in 7.0), FUNCTION FLUSH use lua_unref to remove the script from lua VM. lua_unref does not trigger the gc, it causes us to not be able to effectively reclaim memory after the FUNCTION FLUSH. The other issue is that, since we don't re-create the lua VM in FUNCTION FLUSH, loading the functions during a FUNCTION FLUSH ASYNC will result a crash because lua engine state is not thread-safe. The correct solution is to re-create a new Lua VM to use, just like SCRIPT FLUSH. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Ricardo Dias <[email protected]> Co-authored-by: Ricardo Dias <[email protected]> (cherry picked from commit b4c93cc) Signed-off-by: cherukum-amazon <[email protected]>
increased the wait time to a total of 10 seconds where we check the log for `Done loading RDB` message Fixes valkey-io#2694 CI run (100 times): https://github.com/roshkhatri/valkey/actions/runs/18576201712/job/52961907806 Signed-off-by: Roshan Khatri <[email protected]> (cherry picked from commit 898172b) Signed-off-by: cherukum-amazon <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]> (cherry picked from commit 95154fe) Signed-off-by: cherukum-amazon <[email protected]>
…on (valkey-io#2749) When working on valkey-io#2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <[email protected]> (cherry picked from commit 2a914aa) Signed-off-by: cherukum-amazon <[email protected]>
Signed-off-by: cherukum-amazon <[email protected]> Update Release notes for 9.0.0-GA Signed-off-by: cherukum-amazon <[email protected]> Signed-off-by: cherukum-amazon <[email protected]>
Another late bug fix: #2750 Can we get it in? |
* Fix crash when aborting a slot migration while child snapshot is active (#2721) | ||
* Fix double MOVED reply on unblock at failover (#2734) | ||
* Fix memory leak with CLIENT LIST/KILL duplicate filters (#2362) | ||
* Fix incorrect kvstore size and BIT accounting after completed migration (#2749) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is hard to understand. I'm suggesting a change.
Also I think we should add #1826 and #2750 (together).
* Fix incorrect kvstore size and BIT accounting after completed migration (#2749) | |
* Fix incorrect accounting after completed atomic slot migration (#2749) | |
* Fix Lua VM crash after FUNCTION FLUSH ASYNC + FUNCTION LOAD (#1826, #2750) |
List of commits backported -
Valkey 9.0 (view)