Fast-Path String and Vector Hash Code Methods on Power #21081

luke-li-2003 · 2025-02-06T23:07:49Z

Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed methods on Power. The String.hashCodeImplDecompressed method has already been fast-pathed. Since the other methods use the same logic, the existing code can be modified to recognise and accomodate them.

luke-li-2003 · 2025-02-06T23:09:20Z

Issue: https://github.ibm.com/runtimes/openj9-jit-power/issues/408

luke-li-2003 · 2025-02-07T21:38:54Z

Weirdly, the baseline build is outperforming the fast path implementation on small arrays.

Vectors:

Data Type	Array Length	Base Build	Fast Path
Byte	4	159M	78M
Byte	8	107M	60M
Byte	16	52M	62M
Byte	128	5M	27M
Int	4	170M	125M
Int	8	122M	109M
Int	16	62M	79M
Int	128	6M	13M

Strings:

Length	Base Build	Fast Path
10	65M	48M
20	37M	38M
40	14M	37M
80	7M	16M

luke-li-2003 · 2025-02-10T16:59:49Z

Some updated string data with compressed and uncompressed strings

Compressed:

Length	Base Build	Fast Path
8	69M	70M
16	37M	52M
32	20M	50M

Decompressed:

Length	Base Build	Fast Path
8	81M	78M
16	79M	83M
32	40M	42M

luke-li-2003 · 2025-02-10T17:31:35Z

It seems like String.hashCodeImplDecompressed that has already been implemented shares the behaviour of my changes, namely it is slower than a call-out for strings shorter than 8.

luke-li-2003 · 2025-02-10T22:56:18Z

fyi @zl-wang

rmnattas

Thank you Luke, I have one area of the code that I'm not sure about, but the rest are suggestions or things to bring into attention.

Also, wondering if you tested the code by comparing hashcode output with and without the fast-path to make sure the hashing is the same.

rmnattas · 2025-02-12T20:17:28Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+
+   // Skip header of the array
+   intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes();
+   generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize);


Seems to me that adding a load of dataAddr pointer instead of adding headerSize here is only thing holding enabling this for OffHeap, @zl-wang. Not sure if worth doing it separately here or with other platforms later.

Given that we're in the final stages of enabling OffHeap I think deferring this for later would be better, to not possibly introduce new issues regarding OffHeap. I added it to the OffHeap TODO list.

yes, we can simply do that to support off-heap here.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

luke-li-2003 · 2025-02-12T21:30:29Z

Yep, the code was tested and the values were correct for all the scenario I could think of.

zl-wang · 2025-02-12T22:07:07Z

i can come back to review this later, but i would start by suggesting you read some of the vectorized fast-path implementations, e.g. String.equal/compareTo or String.indexOf. At least, you are able to handle the misaligned part much better on POWER10 (there are vector load/store with length instructions ... or look at arrayCopy helper code on POWER10).

luke-li-2003 · 2025-03-05T15:54:36Z

I will clean up the commits later, so far this build is the fastest overall.

rmnattas

LGTM

luke-li-2003 · 2025-03-10T20:53:16Z

tlog.vecHash.txt

An example generated instructions

    0x73bbf8ac5dc8 0000004c [    0x73bbf04330a0] 3bff0008          56 	addi 	gr31, gr31, 8
    0x73bbf8ac5dcc 00000050 [    0x73bbf0433140] 5484103a          56 	rlwinm 	gr4, gr4, 0000000000000002, FFFFFFFFFFFFFFFC
    0x73bbf8ac5dd0 00000054 [    0x73bbf04331f0] 5400103a          56 	rlwinm 	gr0, gr0, 0000000000000002, FFFFFFFFFFFFFFFC
    0x73bbf8ac5dd4 00000058 [    0x73bbf04332a0] 7fff2214          56 	add 	gr31, gr31, gr4
    0x73bbf8ac5dd8 0000005c [    0x73bbf0433340] 7c9f0214          56 	add 	gr4, gr31, gr0
    0x73bbf8ac5ddc 00000060 [    0x73bbf04333e0] 7ca52a78          56 	xor 	gr5, gr5, gr5
    0x73bbf8ac5de0 00000064 [    0x73bbf0433480] 38c00000          56 	li 	gr6, 0000000000000000
    0x73bbf8ac5de4 00000068 [    0x73bbf0433520] 2c000040          56 	cmpwi 	cr0, gr0, 64
    0x73bbf8ac5de8 0000006c [    0x73bbf04335c0] 418000d4          56 	blt 	cr0, Label L0097
    0x73bbf8ac5dec 00000070 [    0x73bbf0433710] e8703590          56 	ld 	gr3, [gr16, 13712]
    0x73bbf8ac5df0 00000074 [    0x73bbf04337b0] 10a52cc4          56 	vxor 	vr5, vr5, vr5
    0x73bbf8ac5df4 00000078 [    0x73bbf0433850] 10631cc4          56 	vxor 	vr3, vr3, vr3
    0x73bbf8ac5df8 0000007c [    0x73bbf04338f0] 10831d04          56 	vnor 	vr4, vr3, vr3
    0x73bbf8ac5dfc 00000080 [    0x73bbf0433990] 38a0fff0          56 	li 	gr5, FFFFFFFFFFFFFFF0
    0x73bbf8ac5e00 00000084 [    0x73bbf0433a30] 7c802838          56 	and 	gr0, gr4, gr5
    0x73bbf8ac5e04 00000088 [    0x73bbf0433ad0] 7ca201e7          56 	mtvsrwz 	vsr37, gr2
    0x73bbf8ac5e08 0000008c [    0x73bbf0433b70] 10a32a2c          56 	vsldoi 	vsr37, vr3, vsr37, 0000000000000008
    0x73bbf8ac5e0c 00000090 [    0x73bbf0433c20] 10a51b2c          56 	vsldoi 	vsr37, vsr37, vr3, 000000000000000C
    0x73bbf8ac5e10 00000094 [    0x73bbf0433cd0] 38a0000f          56 	li 	gr5, 000000000000000F
    0x73bbf8ac5e14 00000098 [    0x73bbf0433d70] 7fe52838          56 	and 	gr5, gr31, gr5
    0x73bbf8ac5e18 0000009c [    0x73bbf0433e10] 2c050000          56 	cmpwi 	cr0, gr5, 0
    0x73bbf8ac5e1c 000000a0 [    0x73bbf0433eb0] 4182005c          56 	beq 	cr0, Label L0101
    0x73bbf8ac5e20 000000a4 [    0x73bbf0434000] 7c3f30ce          56 	lvx 	vr1, [gr31, gr6]
    0x73bbf8ac5e24 000000a8 [    0x73bbf04340a0] 38a0000f          56 	li 	gr5, 000000000000000F
    0x73bbf8ac5e28 000000ac [    0x73bbf0434140] 7fe52838          56 	and 	gr5, gr31, gr5
    0x73bbf8ac5e2c 000000b0 [    0x73bbf04341e0] 54a51838          56 	rlwinm 	gr5, gr5, 0000000000000003, FFFFFFFFFFFFFFF8
    0x73bbf8ac5e30 000000b4 [    0x73bbf0434290] 7c450167          56 	mtvsrd 	vsr34, gr5
    0x73bbf8ac5e34 000000b8 [    0x73bbf0434330] 1043122c          56 	vsldoi 	vsr34, vr3, vsr34, 0000000000000008
    0x73bbf8ac5e38 000000bc [    0x73bbf04343e0] 1044140c          56 	vslo 	vr2, vr4, vr2
    0x73bbf8ac5e3c 000000c0 [    0x73bbf0434480] 10211404          56 	vand 	vr1, vr1, vr2
    0x73bbf8ac5e40 000000c4 [    0x73bbf0434520] 7ca528f8          56 	nor 	gr5, gr5, gr5
    0x73bbf8ac5e44 000000c8 [    0x73bbf04345c0] 38a50081          56 	addi 	gr5, gr5, 129
    0x73bbf8ac5e48 000000cc [    0x73bbf0434660] 7ca51e70          56 	srawi 	gr5, gr5, 3
    0x73bbf8ac5e4c 000000d0 [    0x73bbf0434700] 38a50010          56 	addi 	gr5, gr5, 16
    0x73bbf8ac5e50 000000d4 [    0x73bbf04347a0] 10452c84          56 	vor 	vr2, vr5, vr5
    0x73bbf8ac5e54 000000d8 [    0x73bbf04348f0] 7c032e19          56 	lxvw4x 	vsr32, [gr3, gr5]
    0x73bbf8ac5e58 000000dc [    0x73bbf0434990] 10420089          56 	vmuluwm 	vr2, vr2, vr0
    0x73bbf8ac5e5c 000000e0 [    0x73bbf0434a30] 10a10c04          56 	vand 	vr5, vr1, vr1
    0x73bbf8ac5e60 000000e4 [    0x73bbf0434ad0] 10a51080          56 	vadduwm 	vr5, vr5, vr2
    0x73bbf8ac5e64 000000e8 [    0x73bbf0434b70] 3bff000f          56 	addi 	gr31, gr31, 15
    0x73bbf8ac5e68 000000ec [    0x73bbf0434c10] 38a0fff0          56 	li 	gr5, FFFFFFFFFFFFFFF0
    0x73bbf8ac5e6c 000000f0 [    0x73bbf0434cb0] 7fff2838          56 	and 	gr31, gr31, gr5
    0x73bbf8ac5e70 000000f4 [    0x73bbf0434d50] 7c3f0000          56 	cmpd 	cr0, gr31, gr0
    0x73bbf8ac5e74 000000f8 [    0x73bbf0434df0] 40800020          56 	bge 	cr0, Label L0103
    0x73bbf8ac5e78 000000fc [    0x73bbf0434e90]                   56 	Label L0101:	
    0x73bbf8ac5e78 000000fc [    0x73bbf0434fd0] 7c033619          56 	lxvw4x 	vsr32, [gr3, gr6]
    0x73bbf8ac5e7c 00000100 [    0x73bbf0435070]                   56 	Label L0102:	
    0x73bbf8ac5e7c 00000100 [    0x73bbf04351b0] 7c3f30ce          56 	lvx 	vr1, [gr31, gr6]
    0x73bbf8ac5e80 00000104 [    0x73bbf0435250] 10a50089          56 	vmuluwm 	vr5, vr5, vr0
    0x73bbf8ac5e84 00000108 [    0x73bbf04352f0] 10a50880          56 	vadduwm 	vr5, vr5, vr1
    0x73bbf8ac5e88 0000010c [    0x73bbf0435390] 3bff0010          56 	addi 	gr31, gr31, 16
    0x73bbf8ac5e8c 00000110 [    0x73bbf0435430] 7c3f0000          56 	cmpd 	cr0, gr31, gr0
    0x73bbf8ac5e90 00000114 [    0x73bbf04354d0] 4180ffec          56 	blt 	cr0, Label L0102
    0x73bbf8ac5e94 00000118 [    0x73bbf0435570]                   56 	Label L0103:	
    0x73bbf8ac5e94 00000118 [    0x73bbf0435600] 7c1f2050          56 	subf 	gr0, gr31, gr4
    0x73bbf8ac5e98 0000011c [    0x73bbf04356a0] 38630010          56 	addi 	gr3, gr3, 16
    0x73bbf8ac5e9c 00000120 [    0x73bbf04357f0] 7c033619          56 	lxvw4x 	vsr32, [gr3, gr6]
    0x73bbf8ac5ea0 00000124 [    0x73bbf0435890] 10a50089          56 	vmuluwm 	vr5, vr5, vr0
    0x73bbf8ac5ea4 00000128 [    0x73bbf0435930] 10232a2c          56 	vsldoi 	vr1, vr3, vr5, 0000000000000008
    0x73bbf8ac5ea8 0000012c [    0x73bbf04359e0] 10a50880          56 	vadduwm 	vr5, vr5, vr1
    0x73bbf8ac5eac 00000130 [    0x73bbf0435a80] 10232b2c          56 	vsldoi 	vr1, vr3, vr5, 000000000000000C
    0x73bbf8ac5eb0 00000134 [    0x73bbf0435b30] 10a50880          56 	vadduwm 	vr5, vr5, vr1
    0x73bbf8ac5eb4 00000138 [    0x73bbf0435bd0] 10251a2c          56 	vsldoi 	vr1, vr5, vr3, 0000000000000008
    0x73bbf8ac5eb8 0000013c [    0x73bbf0435c80] 7c2200e7          56 	mfvsrwz 	gr2, vsr33
    0x73bbf8ac5ebc 00000140 [    0x73bbf0435d20]                   56 	Label L0097:	
    0x73bbf8ac5ebc 00000140 [    0x73bbf0435db0] 2c000008          56 	cmpwi 	cr0, gr0, 8
    0x73bbf8ac5ec0 00000144 [    0x73bbf0435e50] 41820078          56 	beq 	cr0, Label L0100
    0x73bbf8ac5ec4 00000148 [    0x73bbf0435ef0] 3804fff4          56 	addi 	gr0, gr4, -12
    0x73bbf8ac5ec8 0000014c [    0x73bbf0435f90]                   56 	Label L0099:	
    0x73bbf8ac5ec8 0000014c [    0x73bbf0436020] 7c3f0000          56 	cmpd 	cr0, gr31, gr0
    0x73bbf8ac5ecc 00000150 [    0x73bbf04360c0] 4080004c          56 	bge 	cr0, Label L0098
    0x73bbf8ac5ed0 00000154 [    0x73bbf0436160] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5ed4 00000158 [    0x73bbf0436210] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5ed8 0000015c [    0x73bbf0436360] 80bf0000          56 	lwz 	gr5, [gr31, 0]
    0x73bbf8ac5edc 00000160 [    0x73bbf0436400] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5ee0 00000164 [    0x73bbf04364a0] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5ee4 00000168 [    0x73bbf0436550] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5ee8 0000016c [    0x73bbf04366a0] 80bf0004          56 	lwz 	gr5, [gr31, 4]
    0x73bbf8ac5eec 00000170 [    0x73bbf0436740] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5ef0 00000174 [    0x73bbf04367e0] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5ef4 00000178 [    0x73bbf0436890] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5ef8 0000017c [    0x73bbf04369e0] 80bf0008          56 	lwz 	gr5, [gr31, 8]
    0x73bbf8ac5efc 00000180 [    0x73bbf0436a80] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f00 00000184 [    0x73bbf0436b20] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5f04 00000188 [    0x73bbf0436bd0] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5f08 0000018c [    0x73bbf0436d20] 80bf000c          56 	lwz 	gr5, [gr31, 12]
    0x73bbf8ac5f0c 00000190 [    0x73bbf0436dc0] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f10 00000194 [    0x73bbf0436e60] 3bff0010          56 	addi 	gr31, gr31, 16
    0x73bbf8ac5f14 00000198 [    0x73bbf0436f00] 4bffffb4          56 	b 	Label L0099	
    0x73bbf8ac5f18 0000019c [    0x73bbf0436f90]                   56 	Label L0098:	
    0x73bbf8ac5f18 0000019c [    0x73bbf0437020] 7c3f2000          56 	cmpd 	cr0, gr31, gr4
    0x73bbf8ac5f1c 000001a0 [    0x73bbf04370c0] 40800044          56 	bge 	cr0, Label L0104
    0x73bbf8ac5f20 000001a4 [    0x73bbf0437160] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5f24 000001a8 [    0x73bbf0437210] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5f28 000001ac [    0x73bbf0437360] 7cbf302e          56 	lwzx 	gr5, [gr31, gr6]
    0x73bbf8ac5f2c 000001b0 [    0x73bbf0437400] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f30 000001b4 [    0x73bbf04374a0] 3bff0004          56 	addi 	gr31, gr31, 4
    0x73bbf8ac5f34 000001b8 [    0x73bbf0437540] 4bffffe4          56 	b 	Label L0098	
    0x73bbf8ac5f38 000001bc [    0x73bbf04375d0]                   56 	Label L0100:	
    0x73bbf8ac5f38 000001bc [    0x73bbf0437660] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5f3c 000001c0 [    0x73bbf0437710] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5f40 000001c4 [    0x73bbf0437860] 7cbf302e          56 	lwzx 	gr5, [gr31, gr6]
    0x73bbf8ac5f44 000001c8 [    0x73bbf0437900] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f48 000001cc [    0x73bbf04379a0] 3bff0004          56 	addi 	gr31, gr31, 4
    0x73bbf8ac5f4c 000001d0 [    0x73bbf0437a40] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5f50 000001d4 [    0x73bbf0437af0] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5f54 000001d8 [    0x73bbf0437c40] 7cbf302e          56 	lwzx 	gr5, [gr31, gr6]
    0x73bbf8ac5f58 000001dc [    0x73bbf0437ce0] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f5c 000001e0 [    0x73bbf0437d80] 48000004          56 	b 	Label L0104	
    0x73bbf8ac5f60 000001e4 [    0x73bbf0437fa0]                   56 	Label L0104:

zl-wang · 2025-03-13T16:47:35Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+         break;
+      default:
+         TR_ASSERT_FATAL(false, "Unsupported hashCodeHelper elementType");
+      }


need to review carefully if you need these many registers alive in parallel at the same time.

zl-wang · 2025-03-13T16:49:33Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+         TR_ASSERT_FATAL(false, "Unsupported hashCodeHelper elementType");
+      }
+   generateTrg1Src2Instruction(cg, TR::InstOpCode::add, node, hashReg, hashReg, tempReg);
+   generateLabelInstruction(cg, TR::InstOpCode::b, node, endLabel);


need to understand if this sequence is optimal ... it will take some time.

zl-wang · 2025-03-13T16:51:02Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+         if (cg->getSupportsInlineVectorizedHashCode())
+            {
+            resultReg = inlineVectorizedHashCode(node, cg);
+            return resultReg != NULL;


why is this return conditional? whereas the previous one isn't.

zl-wang · 2025-03-13T17:07:07Z

at least, it sounds like it can be improved from the perspective of how many registers are used in parallel. the number of GPRs is close to the limit available on x and z (architecturally 16 in total ... 13 or 14 available at most for codegen. how many did they use in their inlining?)

luke-li-2003 · 2025-03-13T18:10:16Z

It seems Z used 12 registers while this implementation uses 14-19.

The Z implementation uses much fewer registers because their loading facility can load from memory of arbitrary length, while the vector load in P is fixed to 128 bits, this means for bytes I have to handle 16 elements at once as opposed to 4 on Z.

I can probably reduce the registers required at a cost of parallelism, not sure if that's a worthy trade-off.

zl-wang · 2025-03-13T19:11:00Z

I can probably reduce the registers required at a cost of parallelism, not sure if that's a worthy trade-off.

without looking at the exact sequence, i would think it must be worth of doing it (minimizing register footprint), since parallelism (in core) likely is not what you think. written registers are dynamically renamed at instruction dispatch stage, such that Read/Write and Write/Write don't cause dependency or block parallel instruction issuing.

luke-li-2003 · 2025-03-13T20:39:15Z

I am just worried because the original implementation used 2 registers, which got to be for a good reason.

luke-li-2003 · 2025-03-14T01:28:56Z

I got a new implementation with 14-16 registers (compared to 12 on Z) but the performance did dip:

compressed string

build	16	128
master	63M	5M
16-way parallel	82M	21M
4-way parallel	77M	12M

decompressed string (where the master build has existing fast-pathing

build	16	128
master	82M	18M
8-way parallel	88M	19M
4-way parallel	82M	14M

In conclusion, having less accumulating registers does impact the performance, but the performance is still better than the baseine, except for decompressed string, since it has an existing intrinsic that uses more registers.

luke-li-2003 · 2025-03-14T14:56:01Z

The Java source code :

    private static int vectorizedHashCode(Object array, int fromIndex, int length, int initialValue,
                                          int basicType) {
        return switch (basicType) {
            case T_BOOLEAN -> unsignedHashCode(initialValue, (byte[]) array, fromIndex, length);
            case T_CHAR -> array instanceof byte[]
                    ? utf16hashCode(initialValue, (byte[]) array, fromIndex, length)
                    : hashCode(initialValue, (char[]) array, fromIndex, length);
            case T_BYTE -> hashCode(initialValue, (byte[]) array, fromIndex, length);
            case T_SHORT -> hashCode(initialValue, (short[]) array, fromIndex, length);
            case T_INT -> hashCode(initialValue, (int[]) array, fromIndex, length);
                default -> throw new IllegalArgumentException("unrecognized basic type: " + basicType);
        };
    }
    
        private static int hashCode(int result, byte[] a, int fromIndex, int length) {
        int end = fromIndex + length;
        for (int i = fromIndex; i < end; i++) {
            result = 31 * result + a[i];
        }
        return result;
    }

ThehashCode for different data types are basically the same code for a different array type.

zl-wang · 2025-03-17T12:58:45Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+   TR::Register* hashReg = NULL;
+
+   switch (node->getChild(4)->getConstValue())
+      {


other codegen(s) also used these hard-coded constants instead of symbolic constants? how about VM side, since there might be symbolic constants defined already?

zl-wang · 2025-03-17T13:05:34Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+         hashReg = hashCodeHelper(node, cg, TR::Int32, node->getChild(3), true);
+         break;
+      }
+   if (hashReg != NULL)


why not take care of children's referenceCount consistently in an evaluator? instead of doing this conditionally.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

runtime/compiler/p/codegen/J9CodeGenerator.cpp

zl-wang · 2025-03-17T18:36:27Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+
+   // Skip header of the array
+   intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes();
+   generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize);


yes, we can simply do that to support off-heap here.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

zl-wang · 2025-03-17T18:50:19Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+   // Skip header of the array
+   intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes();
+   generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize);
+


high-level comment: vector-path should only be executed under certain condition (i.e. the condition we know it is generally performance beneficial to run it). some more work/investigation are needed in this area to come up with the right condition. you might choose easy/straight-forward over best-performing here.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

zl-wang · 2025-03-17T19:13:51Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+      1, 31, 961, 29791, 923521, 28629151, 887503681, 0x67E12CDF, 0, 0, 0, 0};
+   static uint32_t multiplierVectors_be32[12] = {923521, 923521, 923521, 923521,
+                                                29791, 961, 31, 1, 0, 0, 0, 0};
+   static uint32_t multiplierVectors_le32[12] = {923521, 923521, 923521, 923521,


this made me uneasy at least. these constants are set up as C/C++ static data in libj9jit29.so module. what happen if that module is unloaded? better put them in PersistentMemory like pseudo TOC. Also, this must be the reason it is disabled for AOT and JITServer.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

zl-wang · 2025-03-18T18:03:57Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+   generateTrg1ImmInstruction(cg, TR::InstOpCode::li, node, constant0Reg, 0x0);
+
+   // using the serial loop is faster if there are less than 16 items
+   generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, vendReg, 16*elementSize);


i naturally have questions re this condition if it is optimal (verified) for those different scenarios (byte/char/short/int).

you are still checking for no-less-than 16 elements, instead of 16bytes here, as you indicated in #21081 (comment), such that i took it to be the case that your changes haven't make into this PR.

zl-wang · 2025-03-18T18:53:18Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+   generateTrg1ImmInstruction(cg, TR::InstOpCode::li, node, tempReg, 0xFFFFFFF0);
+   generateTrg1Src2Instruction(cg, TR::InstOpCode::AND, node, vendReg, endReg, tempReg);
+
+   // load the initial value


there are pipeline nuances re short-latency vector instructions on POWER7 and POWER8. most of permute instructions are short-latency instructions (with 2 cycle latency). if there are long-latency instructions (6-7 cycle latency) before them, it might cause a performance problem (result-bus collision leading to something so-called starvation of short-latency instructions). that could be the reason you observed inconsistent performance on POWER8. to avoid it, you have to insert regularly group-ending no-op (ori r2, r2, 0 i believed). we might need to come back to this issue later. not now though.

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed methods on Power. The String.hashCodeImplDecompressed method has already been fast-pathed. Since the other methods use the same logic, the existing code can be modified to recognise and accomodate them. Signed-off-by: Luke Li <[email protected]>

luke-li-2003 · 2025-03-20T22:43:45Z

The new implementation I just pushed uses lxvw4x to avoid having to deal with misaligned data. It is performing about the same for large arrays, but much better for smaller arrays, prompting me to make it so that the vector loop is engaged for any array longer than 16 bytes.

There is just one caveat: it tends to confuse the microbenchmark I am using without tinkering. For example, when run with an int[8] array, this is the throughput:

              Target	Est	Uncert%	MaxPeak	Peak	Peak%	%paused
    0.0s:  >! 110	400.0K	 40.0	110	110	470.0	 98.3
    0.6s:  <  480.0K	435.2K	 24.0	110	110	470.0	 57.6
    1.5s:  <! 382.9K	375.7K	 28.8	110	110	470.0	 53.4
    2.4s:  >  321.6K	370.3K	 17.3	321.6K	321.6K	1268.1	 52.9
    3.4s:  <  402.3K	371.3K	 10.4	321.6K	321.6K	1268.1	 52.7
    4.4s:  >  352.1K	384.7K	  6.2	352.1K	352.1K	1277.2	 53.6
    4.6s:  >! 396.6K	9.673M	 40.0	396.6K	396.6K	1289.1	 91.8
    4.7s:  >! 11.61M	140.8M	 40.0	11.61M	11.61M	1626.7	 88.8
    6.7s:  <  169.0M	141.3M	 24.0	11.61M	11.61M	1626.7	 87.5
    7.1s:  >  124.4M	132.5M	 14.4	124.4M	124.4M	1863.9	 58.5
    8.0s:  <  142.0M	85.05M	 33.4	124.4M	124.4M	1863.9	 24.2
    8.9s:  >  70.85M	85.15M	 20.0	124.4M	124.4M	1863.9	 24.4
   10.0s:  <  93.68M	85.36M	 12.0	124.4M	-inf	--	 24.0

The peak throughput 140M is well beyond the baseline, but it dips back to less than 100M over time. I checked the vlog, and it seemed like the regression happens when the method got recompiled to scorching (from very-hot). I was able to get rid of the regression by setting -Xjit:compilationDelayTime=5 or -Xjit:disableInterpreterProfiling or -Xjit:count=0.

My guess is that it had something to do with profiling, and something went wrong with branch prediction when the recompilation was taking place.

I would argue this is more of a fluke with the microbenchmark, and is not relevant to its real performance.

zl-wang · 2025-03-20T23:02:32Z

agreed. a change for the good really ...

luke-li-2003 · 2025-03-21T18:21:41Z

The latest commit loads the static arrays into the persistent memory similar to how it's done for the pseudo TOC. It doesn't really make it work for AOT or JITServer though, since the TOC doesn't work for AOT and JITServer.

I am not quite sure how I can free those memory eventually.

luke-li-2003 · 2025-03-24T17:46:39Z

The only way I can think of to make the code relocatable is to load the arrays into the vec registers by just using a lot of immediate loads instead. It worked, but the performance did suffer.

zl-wang · 2025-03-31T14:53:30Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+   generateTrg1ImmInstruction(cg, TR::InstOpCode::li, node, constant0Reg, 0x0);
+
+   // using the serial loop is faster if there are less than 16 items
+   generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, vendReg, 16*elementSize);


you are still checking for no-less-than 16 elements, instead of 16bytes here, as you indicated in #21081 (comment), such that i took it to be the case that your changes haven't make into this PR.

zl-wang · 2025-03-31T15:02:25Z

runtime/compiler/p/codegen/J9TreeEvaluator.cpp

+
+   // use a similar concept the the TableOfConstants to load the multiplierPtr into the memory
+   // TOC uses relocation data, so we use the same here
+   uint32_t *multiplierPtr = (uint32_t*) fej9->allocateRelocationData(comp, mvSize * sizeof(uint32_t));


this appears to be a wrong approach, since this can cause gradual memory leak. imagine you have thousands of this call in a java program, you would allocate this constant array thousands of times during JIT-ing. certainly have performance implication too. you still need to set up these arrays statically.

luke-li-2003 changed the title ~~giFast-Path String and Vector Hash Code Methods on Power~~ Fast-Path String and Vector Hash Code Methods on Power Feb 6, 2025

luke-li-2003 force-pushed the FastPathHashCode branch from b8ac845 to bf75642 Compare February 6, 2025 23:09

luke-li-2003 force-pushed the FastPathHashCode branch 2 times, most recently from d61d6a1 to ef1ccb2 Compare February 7, 2025 19:09

rmnattas reviewed Feb 12, 2025

View reviewed changes

luke-li-2003 force-pushed the FastPathHashCode branch 5 times, most recently from 625603d to 7804d8c Compare February 14, 2025 22:04

luke-li-2003 force-pushed the FastPathHashCode branch from 7804d8c to 50ad3a2 Compare March 5, 2025 15:53

luke-li-2003 force-pushed the FastPathHashCode branch from 0db3588 to 6a90b07 Compare March 7, 2025 21:47

rmnattas approved these changes Mar 8, 2025

View reviewed changes

luke-li-2003 force-pushed the FastPathHashCode branch 2 times, most recently from 3d6d88f to 8b2a5bc Compare March 10, 2025 15:17

luke-li-2003 mentioned this pull request Mar 12, 2025

Improve the vectorizedHashCode Intrinsic for P9+ #21338

Draft

zl-wang reviewed Mar 13, 2025

View reviewed changes

zl-wang requested changes Mar 17, 2025

View reviewed changes

runtime/compiler/p/codegen/J9CodeGenerator.cpp Show resolved Hide resolved

luke-li-2003 force-pushed the FastPathHashCode branch 2 times, most recently from 1fd1fc0 to 6943ad3 Compare March 17, 2025 17:45

zl-wang requested changes Mar 17, 2025

View reviewed changes

luke-li-2003 force-pushed the FastPathHashCode branch 3 times, most recently from 3d6b9b7 to 9cdde9e Compare March 17, 2025 20:11

zl-wang requested changes Mar 18, 2025

View reviewed changes

luke-li-2003 force-pushed the FastPathHashCode branch 4 times, most recently from 2227a39 to 1f03701 Compare March 19, 2025 16:16

zl-wang requested changes Mar 19, 2025

View reviewed changes

runtime/compiler/p/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved

runtime/compiler/p/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved

luke-li-2003 force-pushed the FastPathHashCode branch from 1f03701 to 459ce6c Compare March 20, 2025 22:14

pseudo TOC

8234d61

zl-wang requested changes Mar 31, 2025

View reviewed changes

luke-li-2003 added 2 commits March 31, 2025 11:55

another way of doing persistent memory

84fc863

a better size-2 special case

f47c4e6

Fast-Path String and Vector Hash Code Methods on Power #21081

Are you sure you want to change the base?

Fast-Path String and Vector Hash Code Methods on Power #21081

Conversation

luke-li-2003 commented Feb 6, 2025

luke-li-2003 commented Feb 6, 2025

luke-li-2003 commented Feb 7, 2025 • edited Loading

luke-li-2003 commented Feb 10, 2025

luke-li-2003 commented Feb 10, 2025

luke-li-2003 commented Feb 10, 2025

rmnattas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luke-li-2003 commented Feb 12, 2025

zl-wang commented Feb 12, 2025

luke-li-2003 commented Mar 5, 2025

rmnattas left a comment

Choose a reason for hiding this comment

luke-li-2003 commented Mar 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zl-wang commented Mar 13, 2025

luke-li-2003 commented Mar 13, 2025 • edited Loading

zl-wang commented Mar 13, 2025

luke-li-2003 commented Mar 13, 2025

luke-li-2003 commented Mar 14, 2025

compressed string

decompressed string (where the master build has existing fast-pathing

luke-li-2003 commented Mar 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luke-li-2003 commented Mar 20, 2025

zl-wang commented Mar 20, 2025

luke-li-2003 commented Mar 21, 2025

luke-li-2003 commented Mar 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luke-li-2003 commented Feb 7, 2025 •

edited

Loading

luke-li-2003 commented Mar 13, 2025 •

edited

Loading