-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast-Path String and Vector Hash Code Methods on Power #21081
base: master
Are you sure you want to change the base?
Conversation
b8ac845
to
bf75642
Compare
d61d6a1
to
ef1ccb2
Compare
Weirdly, the baseline build is outperforming the fast path implementation on small arrays. Vectors:
Strings:
|
Some updated string data with compressed and uncompressed strings Compressed:
Decompressed:
|
It seems like |
fyi @zl-wang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Luke, I have one area of the code that I'm not sure about, but the rest are suggestions or things to bring into attention.
Also, wondering if you tested the code by comparing hashcode output with and without the fast-path to make sure the hashing is the same.
|
||
// Skip header of the array | ||
intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes(); | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me that adding a load of dataAddr pointer instead of adding headerSize here is only thing holding enabling this for OffHeap, @zl-wang. Not sure if worth doing it separately here or with other platforms later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we're in the final stages of enabling OffHeap I think deferring this for later would be better, to not possibly introduce new issues regarding OffHeap. I added it to the OffHeap TODO list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can simply do that to support off-heap here.
Yep, the code was tested and the values were correct for all the scenario I could think of. |
i can come back to review this later, but i would start by suggesting you read some of the vectorized fast-path implementations, e.g. String.equal/compareTo or String.indexOf. At least, you are able to handle the misaligned part much better on POWER10 (there are vector load/store with length instructions ... or look at arrayCopy helper code on POWER10). |
625603d
to
7804d8c
Compare
7804d8c
to
50ad3a2
Compare
I will clean up the commits later, so far this build is the fastest overall. |
0db3588
to
6a90b07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
3d6d88f
to
8b2a5bc
Compare
An example generated instructions
|
break; | ||
default: | ||
TR_ASSERT_FATAL(false, "Unsupported hashCodeHelper elementType"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to review carefully if you need these many registers alive in parallel at the same time.
TR_ASSERT_FATAL(false, "Unsupported hashCodeHelper elementType"); | ||
} | ||
generateTrg1Src2Instruction(cg, TR::InstOpCode::add, node, hashReg, hashReg, tempReg); | ||
generateLabelInstruction(cg, TR::InstOpCode::b, node, endLabel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to understand if this sequence is optimal ... it will take some time.
if (cg->getSupportsInlineVectorizedHashCode()) | ||
{ | ||
resultReg = inlineVectorizedHashCode(node, cg); | ||
return resultReg != NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this return conditional? whereas the previous one isn't.
at least, it sounds like it can be improved from the perspective of how many registers are used in parallel. the number of GPRs is close to the limit available on x and z (architecturally 16 in total ... 13 or 14 available at most for codegen. how many did they use in their inlining?) |
It seems Z used 12 registers while this implementation uses 14-19. The Z implementation uses much fewer registers because their loading facility can load from memory of arbitrary length, while the vector load in P is fixed to 128 bits, this means for I can probably reduce the registers required at a cost of parallelism, not sure if that's a worthy trade-off. |
without looking at the exact sequence, i would think it must be worth of doing it (minimizing register footprint), since parallelism (in core) likely is not what you think. written registers are dynamically renamed at instruction dispatch stage, such that Read/Write and Write/Write don't cause dependency or block parallel instruction issuing. |
I am just worried because the original implementation used 2 registers, which got to be for a good reason. |
I got a new implementation with 14-16 registers (compared to 12 on Z) but the performance did dip: compressed string
decompressed string (where the master build has existing fast-pathing
In conclusion, having less accumulating registers does impact the performance, but the performance is still better than the baseine, except for decompressed string, since it has an existing intrinsic that uses more registers. |
The Java source code :
The |
TR::Register* hashReg = NULL; | ||
|
||
switch (node->getChild(4)->getConstValue()) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other codegen(s) also used these hard-coded constants instead of symbolic constants? how about VM side, since there might be symbolic constants defined already?
hashReg = hashCodeHelper(node, cg, TR::Int32, node->getChild(3), true); | ||
break; | ||
} | ||
if (hashReg != NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not take care of children's referenceCount consistently in an evaluator? instead of doing this conditionally.
1fd1fc0
to
6943ad3
Compare
|
||
// Skip header of the array | ||
intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes(); | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can simply do that to support off-heap here.
// Skip header of the array | ||
intptr_t hdrSize = TR::Compiler->om.contiguousArrayHeaderSizeInBytes(); | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, valueReg, valueReg, hdrSize); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high-level comment: vector-path should only be executed under certain condition (i.e. the condition we know it is generally performance beneficial to run it). some more work/investigation are needed in this area to come up with the right condition. you might choose easy/straight-forward over best-performing here.
1, 31, 961, 29791, 923521, 28629151, 887503681, 0x67E12CDF, 0, 0, 0, 0}; | ||
static uint32_t multiplierVectors_be32[12] = {923521, 923521, 923521, 923521, | ||
29791, 961, 31, 1, 0, 0, 0, 0}; | ||
static uint32_t multiplierVectors_le32[12] = {923521, 923521, 923521, 923521, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this made me uneasy at least. these constants are set up as C/C++ static data in libj9jit29.so module. what happen if that module is unloaded? better put them in PersistentMemory like pseudo TOC. Also, this must be the reason it is disabled for AOT and JITServer.
3d6b9b7
to
9cdde9e
Compare
generateTrg1ImmInstruction(cg, TR::InstOpCode::li, node, constant0Reg, 0x0); | ||
|
||
// using the serial loop is faster if there are less than 16 items | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, vendReg, 16*elementSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i naturally have questions re this condition if it is optimal (verified) for those different scenarios (byte/char/short/int).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are still checking for no-less-than 16 elements, instead of 16bytes here, as you indicated in #21081 (comment), such that i took it to be the case that your changes haven't make into this PR.
generateTrg1ImmInstruction(cg, TR::InstOpCode::li, node, tempReg, 0xFFFFFFF0); | ||
generateTrg1Src2Instruction(cg, TR::InstOpCode::AND, node, vendReg, endReg, tempReg); | ||
|
||
// load the initial value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are pipeline nuances re short-latency vector instructions on POWER7 and POWER8. most of permute instructions are short-latency instructions (with 2 cycle latency). if there are long-latency instructions (6-7 cycle latency) before them, it might cause a performance problem (result-bus collision leading to something so-called starvation of short-latency instructions). that could be the reason you observed inconsistent performance on POWER8. to avoid it, you have to insert regularly group-ending no-op (ori r2, r2, 0
i believed). we might need to come back to this issue later. not now though.
2227a39
to
1f03701
Compare
Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed methods on Power. The String.hashCodeImplDecompressed method has already been fast-pathed. Since the other methods use the same logic, the existing code can be modified to recognise and accomodate them. Signed-off-by: Luke Li <[email protected]>
1f03701
to
459ce6c
Compare
The new implementation I just pushed uses There is just one caveat: it tends to confuse the microbenchmark I am using without tinkering. For example, when run with an int[8] array, this is the throughput:
The peak throughput 140M is well beyond the baseline, but it dips back to less than 100M over time. I checked the vlog, and it seemed like the regression happens when the method got recompiled to scorching (from very-hot). I was able to get rid of the regression by setting My guess is that it had something to do with profiling, and something went wrong with branch prediction when the recompilation was taking place. I would argue this is more of a fluke with the microbenchmark, and is not relevant to its real performance. |
agreed. a change for the good really ... |
The latest commit loads the static arrays into the persistent memory similar to how it's done for the pseudo TOC. It doesn't really make it work for AOT or JITServer though, since the TOC doesn't work for AOT and JITServer. I am not quite sure how I can free those memory eventually. |
The only way I can think of to make the code relocatable is to load the arrays into the vec registers by just using a lot of immediate loads instead. It worked, but the performance did suffer. |
generateTrg1ImmInstruction(cg, TR::InstOpCode::li, node, constant0Reg, 0x0); | ||
|
||
// using the serial loop is faster if there are less than 16 items | ||
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::cmpi4, node, condReg, vendReg, 16*elementSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are still checking for no-less-than 16 elements, instead of 16bytes here, as you indicated in #21081 (comment), such that i took it to be the case that your changes haven't make into this PR.
|
||
// use a similar concept the the TableOfConstants to load the multiplierPtr into the memory | ||
// TOC uses relocation data, so we use the same here | ||
uint32_t *multiplierPtr = (uint32_t*) fej9->allocateRelocationData(comp, mvSize * sizeof(uint32_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears to be a wrong approach, since this can cause gradual memory leak. imagine you have thousands of this call in a java program, you would allocate this constant array thousands of times during JIT-ing. certainly have performance implication too. you still need to set up these arrays statically.
Fast-path ArraysSupport.vectorizedHashCode and String.hashCodeImplCompressed methods on Power. The String.hashCodeImplDecompressed method has already been fast-pathed. Since the other methods use the same logic, the existing code can be modified to recognise and accomodate them.