Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-Path String and Vector Hash Code Methods on Power #21081

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

luke-li-2003
Copy link
Contributor

Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed methods on Power. The String.hashCodeImplDecompressed method has already been fast-pathed. Since the other methods use the same logic, the existing code can be modified to recognise and accomodate them.

@luke-li-2003 luke-li-2003 changed the title giFast-Path String and Vector Hash Code Methods on Power Fast-Path String and Vector Hash Code Methods on Power Feb 6, 2025
@luke-li-2003
Copy link
Contributor Author

@luke-li-2003 luke-li-2003 force-pushed the FastPathHashCode branch 2 times, most recently from d61d6a1 to ef1ccb2 Compare February 7, 2025 19:09
@luke-li-2003
Copy link
Contributor Author

luke-li-2003 commented Feb 7, 2025

Weirdly, the baseline build is outperforming the fast path implementation on small arrays.

Vectors:

Data Type Array Length Base Build Fast Path
Byte 4 159M 78M
Byte 8 107M 60M
Byte 16 52M 62M
Byte 128 5M 27M
Int 4 170M 125M
Int 8 122M 109M
Int 16 62M 79M
Int 128 6M 13M

Strings:

Length Base Build Fast Path
10 65M 48M
20 37M 38M
40 14M 37M
80 7M 16M

@luke-li-2003
Copy link
Contributor Author

Some updated string data with compressed and uncompressed strings

Compressed:

Length Base Build Fast Path
8 69M 70M
16 37M 52M
32 20M 50M

Decompressed:

Length Base Build Fast Path
8 81M 78M
16 79M 83M
32 40M 42M

@luke-li-2003
Copy link
Contributor Author

It seems like String.hashCodeImplDecompressed that has already been implemented shares the behaviour of my changes, namely it is slower than a call-out for strings shorter than 8.

@luke-li-2003
Copy link
Contributor Author

fyi @zl-wang

Copy link
Contributor

@rmnattas rmnattas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Luke, I have one area of the code that I'm not sure about, but the rest are suggestions or things to bring into attention.

Also, wondering if you tested the code by comparing hashcode output with and without the fast-path to make sure the hashing is the same.


// Skip header of the array
intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes();
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that adding a load of dataAddr pointer instead of adding headerSize here is only thing holding enabling this for OffHeap, @zl-wang. Not sure if worth doing it separately here or with other platforms later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're in the final stages of enabling OffHeap I think deferring this for later would be better, to not possibly introduce new issues regarding OffHeap. I added it to the OffHeap TODO list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can simply do that to support off-heap here.

@luke-li-2003
Copy link
Contributor Author

Yep, the code was tested and the values were correct for all the scenario I could think of.

@zl-wang
Copy link
Contributor

zl-wang commented Feb 12, 2025

i can come back to review this later, but i would start by suggesting you read some of the vectorized fast-path implementations, e.g. String.equal/compareTo or String.indexOf. At least, you are able to handle the misaligned part much better on POWER10 (there are vector load/store with length instructions ... or look at arrayCopy helper code on POWER10).

@luke-li-2003 luke-li-2003 force-pushed the FastPathHashCode branch 5 times, most recently from 625603d to 7804d8c Compare February 14, 2025 22:04
@luke-li-2003
Copy link
Contributor Author

I will clean up the commits later, so far this build is the fastest overall.

Copy link
Contributor

@rmnattas rmnattas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luke-li-2003 luke-li-2003 force-pushed the FastPathHashCode branch 2 times, most recently from 3d6d88f to 8b2a5bc Compare March 10, 2025 15:17
@luke-li-2003
Copy link
Contributor Author

tlog.vecHash.txt

An example generated instructions

    0x73bbf8ac5dc8 0000004c [    0x73bbf04330a0] 3bff0008          56 	addi 	gr31, gr31, 8
    0x73bbf8ac5dcc 00000050 [    0x73bbf0433140] 5484103a          56 	rlwinm 	gr4, gr4, 0000000000000002, FFFFFFFFFFFFFFFC
    0x73bbf8ac5dd0 00000054 [    0x73bbf04331f0] 5400103a          56 	rlwinm 	gr0, gr0, 0000000000000002, FFFFFFFFFFFFFFFC
    0x73bbf8ac5dd4 00000058 [    0x73bbf04332a0] 7fff2214          56 	add 	gr31, gr31, gr4
    0x73bbf8ac5dd8 0000005c [    0x73bbf0433340] 7c9f0214          56 	add 	gr4, gr31, gr0
    0x73bbf8ac5ddc 00000060 [    0x73bbf04333e0] 7ca52a78          56 	xor 	gr5, gr5, gr5
    0x73bbf8ac5de0 00000064 [    0x73bbf0433480] 38c00000          56 	li 	gr6, 0000000000000000
    0x73bbf8ac5de4 00000068 [    0x73bbf0433520] 2c000040          56 	cmpwi 	cr0, gr0, 64
    0x73bbf8ac5de8 0000006c [    0x73bbf04335c0] 418000d4          56 	blt 	cr0, Label L0097
    0x73bbf8ac5dec 00000070 [    0x73bbf0433710] e8703590          56 	ld 	gr3, [gr16, 13712]
    0x73bbf8ac5df0 00000074 [    0x73bbf04337b0] 10a52cc4          56 	vxor 	vr5, vr5, vr5
    0x73bbf8ac5df4 00000078 [    0x73bbf0433850] 10631cc4          56 	vxor 	vr3, vr3, vr3
    0x73bbf8ac5df8 0000007c [    0x73bbf04338f0] 10831d04          56 	vnor 	vr4, vr3, vr3
    0x73bbf8ac5dfc 00000080 [    0x73bbf0433990] 38a0fff0          56 	li 	gr5, FFFFFFFFFFFFFFF0
    0x73bbf8ac5e00 00000084 [    0x73bbf0433a30] 7c802838          56 	and 	gr0, gr4, gr5
    0x73bbf8ac5e04 00000088 [    0x73bbf0433ad0] 7ca201e7          56 	mtvsrwz 	vsr37, gr2
    0x73bbf8ac5e08 0000008c [    0x73bbf0433b70] 10a32a2c          56 	vsldoi 	vsr37, vr3, vsr37, 0000000000000008
    0x73bbf8ac5e0c 00000090 [    0x73bbf0433c20] 10a51b2c          56 	vsldoi 	vsr37, vsr37, vr3, 000000000000000C
    0x73bbf8ac5e10 00000094 [    0x73bbf0433cd0] 38a0000f          56 	li 	gr5, 000000000000000F
    0x73bbf8ac5e14 00000098 [    0x73bbf0433d70] 7fe52838          56 	and 	gr5, gr31, gr5
    0x73bbf8ac5e18 0000009c [    0x73bbf0433e10] 2c050000          56 	cmpwi 	cr0, gr5, 0
    0x73bbf8ac5e1c 000000a0 [    0x73bbf0433eb0] 4182005c          56 	beq 	cr0, Label L0101
    0x73bbf8ac5e20 000000a4 [    0x73bbf0434000] 7c3f30ce          56 	lvx 	vr1, [gr31, gr6]
    0x73bbf8ac5e24 000000a8 [    0x73bbf04340a0] 38a0000f          56 	li 	gr5, 000000000000000F
    0x73bbf8ac5e28 000000ac [    0x73bbf0434140] 7fe52838          56 	and 	gr5, gr31, gr5
    0x73bbf8ac5e2c 000000b0 [    0x73bbf04341e0] 54a51838          56 	rlwinm 	gr5, gr5, 0000000000000003, FFFFFFFFFFFFFFF8
    0x73bbf8ac5e30 000000b4 [    0x73bbf0434290] 7c450167          56 	mtvsrd 	vsr34, gr5
    0x73bbf8ac5e34 000000b8 [    0x73bbf0434330] 1043122c          56 	vsldoi 	vsr34, vr3, vsr34, 0000000000000008
    0x73bbf8ac5e38 000000bc [    0x73bbf04343e0] 1044140c          56 	vslo 	vr2, vr4, vr2
    0x73bbf8ac5e3c 000000c0 [    0x73bbf0434480] 10211404          56 	vand 	vr1, vr1, vr2
    0x73bbf8ac5e40 000000c4 [    0x73bbf0434520] 7ca528f8          56 	nor 	gr5, gr5, gr5
    0x73bbf8ac5e44 000000c8 [    0x73bbf04345c0] 38a50081          56 	addi 	gr5, gr5, 129
    0x73bbf8ac5e48 000000cc [    0x73bbf0434660] 7ca51e70          56 	srawi 	gr5, gr5, 3
    0x73bbf8ac5e4c 000000d0 [    0x73bbf0434700] 38a50010          56 	addi 	gr5, gr5, 16
    0x73bbf8ac5e50 000000d4 [    0x73bbf04347a0] 10452c84          56 	vor 	vr2, vr5, vr5
    0x73bbf8ac5e54 000000d8 [    0x73bbf04348f0] 7c032e19          56 	lxvw4x 	vsr32, [gr3, gr5]
    0x73bbf8ac5e58 000000dc [    0x73bbf0434990] 10420089          56 	vmuluwm 	vr2, vr2, vr0
    0x73bbf8ac5e5c 000000e0 [    0x73bbf0434a30] 10a10c04          56 	vand 	vr5, vr1, vr1
    0x73bbf8ac5e60 000000e4 [    0x73bbf0434ad0] 10a51080          56 	vadduwm 	vr5, vr5, vr2
    0x73bbf8ac5e64 000000e8 [    0x73bbf0434b70] 3bff000f          56 	addi 	gr31, gr31, 15
    0x73bbf8ac5e68 000000ec [    0x73bbf0434c10] 38a0fff0          56 	li 	gr5, FFFFFFFFFFFFFFF0
    0x73bbf8ac5e6c 000000f0 [    0x73bbf0434cb0] 7fff2838          56 	and 	gr31, gr31, gr5
    0x73bbf8ac5e70 000000f4 [    0x73bbf0434d50] 7c3f0000          56 	cmpd 	cr0, gr31, gr0
    0x73bbf8ac5e74 000000f8 [    0x73bbf0434df0] 40800020          56 	bge 	cr0, Label L0103
    0x73bbf8ac5e78 000000fc [    0x73bbf0434e90]                   56 	Label L0101:	
    0x73bbf8ac5e78 000000fc [    0x73bbf0434fd0] 7c033619          56 	lxvw4x 	vsr32, [gr3, gr6]
    0x73bbf8ac5e7c 00000100 [    0x73bbf0435070]                   56 	Label L0102:	
    0x73bbf8ac5e7c 00000100 [    0x73bbf04351b0] 7c3f30ce          56 	lvx 	vr1, [gr31, gr6]
    0x73bbf8ac5e80 00000104 [    0x73bbf0435250] 10a50089          56 	vmuluwm 	vr5, vr5, vr0
    0x73bbf8ac5e84 00000108 [    0x73bbf04352f0] 10a50880          56 	vadduwm 	vr5, vr5, vr1
    0x73bbf8ac5e88 0000010c [    0x73bbf0435390] 3bff0010          56 	addi 	gr31, gr31, 16
    0x73bbf8ac5e8c 00000110 [    0x73bbf0435430] 7c3f0000          56 	cmpd 	cr0, gr31, gr0
    0x73bbf8ac5e90 00000114 [    0x73bbf04354d0] 4180ffec          56 	blt 	cr0, Label L0102
    0x73bbf8ac5e94 00000118 [    0x73bbf0435570]                   56 	Label L0103:	
    0x73bbf8ac5e94 00000118 [    0x73bbf0435600] 7c1f2050          56 	subf 	gr0, gr31, gr4
    0x73bbf8ac5e98 0000011c [    0x73bbf04356a0] 38630010          56 	addi 	gr3, gr3, 16
    0x73bbf8ac5e9c 00000120 [    0x73bbf04357f0] 7c033619          56 	lxvw4x 	vsr32, [gr3, gr6]
    0x73bbf8ac5ea0 00000124 [    0x73bbf0435890] 10a50089          56 	vmuluwm 	vr5, vr5, vr0
    0x73bbf8ac5ea4 00000128 [    0x73bbf0435930] 10232a2c          56 	vsldoi 	vr1, vr3, vr5, 0000000000000008
    0x73bbf8ac5ea8 0000012c [    0x73bbf04359e0] 10a50880          56 	vadduwm 	vr5, vr5, vr1
    0x73bbf8ac5eac 00000130 [    0x73bbf0435a80] 10232b2c          56 	vsldoi 	vr1, vr3, vr5, 000000000000000C
    0x73bbf8ac5eb0 00000134 [    0x73bbf0435b30] 10a50880          56 	vadduwm 	vr5, vr5, vr1
    0x73bbf8ac5eb4 00000138 [    0x73bbf0435bd0] 10251a2c          56 	vsldoi 	vr1, vr5, vr3, 0000000000000008
    0x73bbf8ac5eb8 0000013c [    0x73bbf0435c80] 7c2200e7          56 	mfvsrwz 	gr2, vsr33
    0x73bbf8ac5ebc 00000140 [    0x73bbf0435d20]                   56 	Label L0097:	
    0x73bbf8ac5ebc 00000140 [    0x73bbf0435db0] 2c000008          56 	cmpwi 	cr0, gr0, 8
    0x73bbf8ac5ec0 00000144 [    0x73bbf0435e50] 41820078          56 	beq 	cr0, Label L0100
    0x73bbf8ac5ec4 00000148 [    0x73bbf0435ef0] 3804fff4          56 	addi 	gr0, gr4, -12
    0x73bbf8ac5ec8 0000014c [    0x73bbf0435f90]                   56 	Label L0099:	
    0x73bbf8ac5ec8 0000014c [    0x73bbf0436020] 7c3f0000          56 	cmpd 	cr0, gr31, gr0
    0x73bbf8ac5ecc 00000150 [    0x73bbf04360c0] 4080004c          56 	bge 	cr0, Label L0098
    0x73bbf8ac5ed0 00000154 [    0x73bbf0436160] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5ed4 00000158 [    0x73bbf0436210] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5ed8 0000015c [    0x73bbf0436360] 80bf0000          56 	lwz 	gr5, [gr31, 0]
    0x73bbf8ac5edc 00000160 [    0x73bbf0436400] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5ee0 00000164 [    0x73bbf04364a0] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5ee4 00000168 [    0x73bbf0436550] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5ee8 0000016c [    0x73bbf04366a0] 80bf0004          56 	lwz 	gr5, [gr31, 4]
    0x73bbf8ac5eec 00000170 [    0x73bbf0436740] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5ef0 00000174 [    0x73bbf04367e0] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5ef4 00000178 [    0x73bbf0436890] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5ef8 0000017c [    0x73bbf04369e0] 80bf0008          56 	lwz 	gr5, [gr31, 8]
    0x73bbf8ac5efc 00000180 [    0x73bbf0436a80] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f00 00000184 [    0x73bbf0436b20] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5f04 00000188 [    0x73bbf0436bd0] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5f08 0000018c [    0x73bbf0436d20] 80bf000c          56 	lwz 	gr5, [gr31, 12]
    0x73bbf8ac5f0c 00000190 [    0x73bbf0436dc0] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f10 00000194 [    0x73bbf0436e60] 3bff0010          56 	addi 	gr31, gr31, 16
    0x73bbf8ac5f14 00000198 [    0x73bbf0436f00] 4bffffb4          56 	b 	Label L0099	
    0x73bbf8ac5f18 0000019c [    0x73bbf0436f90]                   56 	Label L0098:	
    0x73bbf8ac5f18 0000019c [    0x73bbf0437020] 7c3f2000          56 	cmpd 	cr0, gr31, gr4
    0x73bbf8ac5f1c 000001a0 [    0x73bbf04370c0] 40800044          56 	bge 	cr0, Label L0104
    0x73bbf8ac5f20 000001a4 [    0x73bbf0437160] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5f24 000001a8 [    0x73bbf0437210] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5f28 000001ac [    0x73bbf0437360] 7cbf302e          56 	lwzx 	gr5, [gr31, gr6]
    0x73bbf8ac5f2c 000001b0 [    0x73bbf0437400] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f30 000001b4 [    0x73bbf04374a0] 3bff0004          56 	addi 	gr31, gr31, 4
    0x73bbf8ac5f34 000001b8 [    0x73bbf0437540] 4bffffe4          56 	b 	Label L0098	
    0x73bbf8ac5f38 000001bc [    0x73bbf04375d0]                   56 	Label L0100:	
    0x73bbf8ac5f38 000001bc [    0x73bbf0437660] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5f3c 000001c0 [    0x73bbf0437710] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5f40 000001c4 [    0x73bbf0437860] 7cbf302e          56 	lwzx 	gr5, [gr31, gr6]
    0x73bbf8ac5f44 000001c8 [    0x73bbf0437900] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f48 000001cc [    0x73bbf04379a0] 3bff0004          56 	addi 	gr31, gr31, 4
    0x73bbf8ac5f4c 000001d0 [    0x73bbf0437a40] 54452834          56 	rlwinm 	gr5, gr2, 0000000000000005, FFFFFFFFFFFFFFE0
    0x73bbf8ac5f50 000001d4 [    0x73bbf0437af0] 7c422850          56 	subf 	gr2, gr2, gr5
    0x73bbf8ac5f54 000001d8 [    0x73bbf0437c40] 7cbf302e          56 	lwzx 	gr5, [gr31, gr6]
    0x73bbf8ac5f58 000001dc [    0x73bbf0437ce0] 7c422a14          56 	add 	gr2, gr2, gr5
    0x73bbf8ac5f5c 000001e0 [    0x73bbf0437d80] 48000004          56 	b 	Label L0104	
    0x73bbf8ac5f60 000001e4 [    0x73bbf0437fa0]                   56 	Label L0104:	

break;
default:
TR_ASSERT_FATAL(false, "Unsupported hashCodeHelper elementType");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to review carefully if you need these many registers alive in parallel at the same time.

TR_ASSERT_FATAL(false, "Unsupported hashCodeHelper elementType");
}
generateTrg1Src2Instruction(cg, TR::InstOpCode::add, node, hashReg, hashReg, tempReg);
generateLabelInstruction(cg, TR::InstOpCode::b, node, endLabel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to understand if this sequence is optimal ... it will take some time.

if (cg->getSupportsInlineVectorizedHashCode())
{
resultReg = inlineVectorizedHashCode(node, cg);
return resultReg != NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this return conditional? whereas the previous one isn't.

@zl-wang
Copy link
Contributor

zl-wang commented Mar 13, 2025

at least, it sounds like it can be improved from the perspective of how many registers are used in parallel. the number of GPRs is close to the limit available on x and z (architecturally 16 in total ... 13 or 14 available at most for codegen. how many did they use in their inlining?)

@luke-li-2003
Copy link
Contributor Author

luke-li-2003 commented Mar 13, 2025

It seems Z used 12 registers while this implementation uses 14-19.

The Z implementation uses much fewer registers because their loading facility can load from memory of arbitrary length, while the vector load in P is fixed to 128 bits, this means for bytes I have to handle 16 elements at once as opposed to 4 on Z.

I can probably reduce the registers required at a cost of parallelism, not sure if that's a worthy trade-off.

@zl-wang
Copy link
Contributor

zl-wang commented Mar 13, 2025

I can probably reduce the registers required at a cost of parallelism, not sure if that's a worthy trade-off.

without looking at the exact sequence, i would think it must be worth of doing it (minimizing register footprint), since parallelism (in core) likely is not what you think. written registers are dynamically renamed at instruction dispatch stage, such that Read/Write and Write/Write don't cause dependency or block parallel instruction issuing.

@luke-li-2003
Copy link
Contributor Author

I am just worried because the original implementation used 2 registers, which got to be for a good reason.

@luke-li-2003
Copy link
Contributor Author

I got a new implementation with 14-16 registers (compared to 12 on Z) but the performance did dip:

compressed string

build 16 128
master 63M 5M
16-way parallel 82M 21M
4-way parallel 77M 12M

decompressed string (where the master build has existing fast-pathing

build 16 128
master 82M 18M
8-way parallel 88M 19M
4-way parallel 82M 14M

In conclusion, having less accumulating registers does impact the performance, but the performance is still better than the baseine, except for decompressed string, since it has an existing intrinsic that uses more registers.

@luke-li-2003
Copy link
Contributor Author

The Java source code :

    private static int vectorizedHashCode(Object array, int fromIndex, int length, int initialValue,
                                          int basicType) {
        return switch (basicType) {
            case T_BOOLEAN -> unsignedHashCode(initialValue, (byte[]) array, fromIndex, length);
            case T_CHAR -> array instanceof byte[]
                    ? utf16hashCode(initialValue, (byte[]) array, fromIndex, length)
                    : hashCode(initialValue, (char[]) array, fromIndex, length);
            case T_BYTE -> hashCode(initialValue, (byte[]) array, fromIndex, length);
            case T_SHORT -> hashCode(initialValue, (short[]) array, fromIndex, length);
            case T_INT -> hashCode(initialValue, (int[]) array, fromIndex, length);
                default -> throw new IllegalArgumentException("unrecognized basic type: " + basicType);
        };
    }
    
        private static int hashCode(int result, byte[] a, int fromIndex, int length) {
        int end = fromIndex + length;
        for (int i = fromIndex; i < end; i++) {
            result = 31 * result + a[i];
        }
        return result;
    }

ThehashCode for different data types are basically the same code for a different array type.

TR::Register* hashReg = NULL;

switch (node->getChild(4)->getConstValue())
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other codegen(s) also used these hard-coded constants instead of symbolic constants? how about VM side, since there might be symbolic constants defined already?

hashReg = hashCodeHelper(node, cg, TR::Int32, node->getChild(3), true);
break;
}
if (hashReg != NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not take care of children's referenceCount consistently in an evaluator? instead of doing this conditionally.

@luke-li-2003 luke-li-2003 force-pushed the FastPathHashCode branch 2 times, most recently from 1fd1fc0 to 6943ad3 Compare March 17, 2025 17:45

// Skip header of the array
intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes();
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can simply do that to support off-heap here.

// Skip header of the array
intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes();
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high-level comment: vector-path should only be executed under certain condition (i.e. the condition we know it is generally performance beneficial to run it). some more work/investigation are needed in this area to come up with the right condition. you might choose easy/straight-forward over best-performing here.

1, 31, 961, 29791, 923521, 28629151, 887503681, 0x67E12CDF, 0, 0, 0, 0};
static uint32_t multiplierVectors_be32[12] = {923521, 923521, 923521, 923521,
29791, 961, 31, 1, 0, 0, 0, 0};
static uint32_t multiplierVectors_le32[12] = {923521, 923521, 923521, 923521,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this made me uneasy at least. these constants are set up as C/C++ static data in libj9jit29.so module. what happen if that module is unloaded? better put them in PersistentMemory like pseudo TOC. Also, this must be the reason it is disabled for AOT and JITServer.

@luke-li-2003 luke-li-2003 force-pushed the FastPathHashCode branch 3 times, most recently from 3d6b9b7 to 9cdde9e Compare March 17, 2025 20:11
generateTrg1ImmInstruction(cg, TR::InstOpCode::li, node, constant0Reg, 0x0);

// using the serial loop is faster if there are less than 16 items
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, vendReg, 16*elementSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i naturally have questions re this condition if it is optimal (verified) for those different scenarios (byte/char/short/int).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are still checking for no-less-than 16 elements, instead of 16bytes here, as you indicated in #21081 (comment), such that i took it to be the case that your changes haven't make into this PR.

generateTrg1ImmInstruction(cg, TR::InstOpCode::li, node, tempReg, 0xFFFFFFF0);
generateTrg1Src2Instruction(cg, TR::InstOpCode::AND, node, vendReg, endReg, tempReg);

// load the initial value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are pipeline nuances re short-latency vector instructions on POWER7 and POWER8. most of permute instructions are short-latency instructions (with 2 cycle latency). if there are long-latency instructions (6-7 cycle latency) before them, it might cause a performance problem (result-bus collision leading to something so-called starvation of short-latency instructions). that could be the reason you observed inconsistent performance on POWER8. to avoid it, you have to insert regularly group-ending no-op (ori r2, r2, 0 i believed). we might need to come back to this issue later. not now though.

@luke-li-2003 luke-li-2003 force-pushed the FastPathHashCode branch 4 times, most recently from 2227a39 to 1f03701 Compare March 19, 2025 16:16
Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed
methods on Power. The String.hashCodeImplDecompressed method has already
been fast-pathed. Since the other methods use the same logic, the existing
code can be modified to recognise and accomodate them.

Signed-off-by: Luke Li <[email protected]>
@luke-li-2003
Copy link
Contributor Author

The new implementation I just pushed uses lxvw4x to avoid having to deal with misaligned data. It is performing about the same for large arrays, but much better for smaller arrays, prompting me to make it so that the vector loop is engaged for any array longer than 16 bytes.

There is just one caveat: it tends to confuse the microbenchmark I am using without tinkering. For example, when run with an int[8] array, this is the throughput:

              Target	Est	Uncert%	MaxPeak	Peak	Peak%	%paused
    0.0s:  >! 110	400.0K	 40.0	110	110	470.0	 98.3
    0.6s:  <  480.0K	435.2K	 24.0	110	110	470.0	 57.6
    1.5s:  <! 382.9K	375.7K	 28.8	110	110	470.0	 53.4
    2.4s:  >  321.6K	370.3K	 17.3	321.6K	321.6K	1268.1	 52.9
    3.4s:  <  402.3K	371.3K	 10.4	321.6K	321.6K	1268.1	 52.7
    4.4s:  >  352.1K	384.7K	  6.2	352.1K	352.1K	1277.2	 53.6
    4.6s:  >! 396.6K	9.673M	 40.0	396.6K	396.6K	1289.1	 91.8
    4.7s:  >! 11.61M	140.8M	 40.0	11.61M	11.61M	1626.7	 88.8
    6.7s:  <  169.0M	141.3M	 24.0	11.61M	11.61M	1626.7	 87.5
    7.1s:  >  124.4M	132.5M	 14.4	124.4M	124.4M	1863.9	 58.5
    8.0s:  <  142.0M	85.05M	 33.4	124.4M	124.4M	1863.9	 24.2
    8.9s:  >  70.85M	85.15M	 20.0	124.4M	124.4M	1863.9	 24.4
   10.0s:  <  93.68M	85.36M	 12.0	124.4M	-inf	--	 24.0

The peak throughput 140M is well beyond the baseline, but it dips back to less than 100M over time. I checked the vlog, and it seemed like the regression happens when the method got recompiled to scorching (from very-hot). I was able to get rid of the regression by setting -Xjit:compilationDelayTime=5 or -Xjit:disableInterpreterProfiling or -Xjit:count=0.

My guess is that it had something to do with profiling, and something went wrong with branch prediction when the recompilation was taking place.

I would argue this is more of a fluke with the microbenchmark, and is not relevant to its real performance.

@zl-wang
Copy link
Contributor

zl-wang commented Mar 20, 2025

agreed. a change for the good really ...

@luke-li-2003
Copy link
Contributor Author

The latest commit loads the static arrays into the persistent memory similar to how it's done for the pseudo TOC. It doesn't really make it work for AOT or JITServer though, since the TOC doesn't work for AOT and JITServer.

I am not quite sure how I can free those memory eventually.

@luke-li-2003
Copy link
Contributor Author

The only way I can think of to make the code relocatable is to load the arrays into the vec registers by just using a lot of immediate loads instead. It worked, but the performance did suffer.

generateTrg1ImmInstruction(cg, TR::InstOpCode::li, node, constant0Reg, 0x0);

// using the serial loop is faster if there are less than 16 items
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, vendReg, 16*elementSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are still checking for no-less-than 16 elements, instead of 16bytes here, as you indicated in #21081 (comment), such that i took it to be the case that your changes haven't make into this PR.


// use a similar concept the the TableOfConstants to load the multiplierPtr into the memory
// TOC uses relocation data, so we use the same here
uint32_t *multiplierPtr = (uint32_t*) fej9->allocateRelocationData(comp, mvSize * sizeof(uint32_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to be a wrong approach, since this can cause gradual memory leak. imagine you have thousands of this call in a java program, you would allocate this constant array thousands of times during JIT-ing. certainly have performance implication too. you still need to set up these arrays statically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants