Skip to content

Conversation

@RagnarGrootKoerkamp
Copy link
Contributor

@RagnarGrootKoerkamp RagnarGrootKoerkamp commented Apr 6, 2023

This encodes horizontal input/output deltas using Phin and Mhin indicator words with the lowest bit set if hin is +1 or -1 respectively. This gives up to 20% speed gains on large inputs where the bitpacking block computation takes relatively longer than the accounting for band-size.

Some small alignments get up to 2% slower but I suspect this is just noise.
image

@maxbachmann
Copy link
Contributor

maxbachmann commented Apr 18, 2023

I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time 👍

After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:

  • edlib_old: 21.7s
  • edlib_new: 16.6s
  • rapidfuzz: 16.9s

The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.

@Martinsos
Copy link
Owner

I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time +1

After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:

  • edlib_old: 21.7s
  • edlib_new: 16.6s
  • rapidfuzz: 16.9s

The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.

Ok this will be great then!

Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.

@maxbachmann
Copy link
Contributor

maxbachmann commented Apr 19, 2023

Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.

I personally use a couple implementations for the standard Levenshtein distance:

  • if the allowed edits are <= 3 use a brute force approach testing all possible edit combination

  • if sequence length < 64 use hyrroes/myers algorithm, but without blocks

  • if the band size is < 64 use an implementation calculating the bit vectors on the fly

  • if both length and band size is greater use the blockwise implementation

  • I provide a way to cache the bitvectors. While the creation of bitvectors is relatively cheap to create for long sequences, this can mean a significant speedup when comparing multiple shorter sequences.

  • when using short sequences < 64 characters I provide a simd implementation using sse2 and avx2 to compare multiple sequences in parallel

All of this is done inside the library, as long as the user calls it in an appropriate way. E.g. I can only use the simd implementation if the user calls the library with multiple strings.

@RagnarGrootKoerkamp
Copy link
Contributor Author

@Martinsos I just came across this again during the development of sassy.

Would you merge this? I think it's pretty safe and simple?

Otherwise I'd have to decide whether to bench edlib with or without this patch.
If I include it it's confusing because it's not edlib master, but if I exclude it, you loose some perf (well good for me, but edlib could just be better.)

@Martinsos
Copy link
Owner

Hey @RagnarGrootKoerkamp , I will be looking into this in the following days! My apologies for not handling it sooner -> I had hard time finding time for bigger Edlib changes due to other obligations, but I should be able to carve some time now. I expect it to be a relatively straightforward merge, but I want to make sure I study it properly.

@RagnarGrootKoerkamp
Copy link
Contributor Author

Sounds great!

Copy link
Owner

@Martinsos Martinsos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RagnarGrootKoerkamp reviewed the changes, please take a look!

In general, all makes sense, I love the idea. I mostly commented on smaller stuff + some code reorganization for consistency / readability. Let me know how you like it, and if you are able to apply those changes.

Comment on lines +408 to +409
* @param [out] Phout Bitset, 00..01 when hout == +1.
* @param [out] Mhout Bitset, 00..01 when hout == -1.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a single output, a pair, not two outputs, so let's show it as such here, as a single @param?

Comment on lines +404 to +405
* @param [in] Phin Bitset, 00..01 when hin == +1.
* @param [in] Mhin Bitset, 00..01 when hin == -1.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param [in] Phin Bitset, 00..01 when hin == +1.
* @param [in] Mhin Bitset, 00..01 when hin == -1.
* @param [in] Phin Bitset, 00..01 when hin == +1, otherwise 00..00.
* @param [in] Mhin Bitset, 00..01 when hin == -1, otherwise 00..00.

Comment on lines +408 to +409
* @param [out] Phout Bitset, 00..01 when hout == +1.
* @param [out] Mhout Bitset, 00..01 when hout == -1.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param [out] Phout Bitset, 00..01 when hout == +1.
* @param [out] Mhout Bitset, 00..01 when hout == -1.
* @param [out] Phout Bitset, 00..01 when hout == +1, otherwise 00..00.
* @param [out] Mhout Bitset, 00..01 when hout == -1, otherwise 00..00.

Comment on lines 413 to 416
// hin can be 1, -1 or 0.
// 1 -> 00...01
// 0 -> 00...00
// -1 -> 11...11 (2-complement)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not that meaningful any more, we can maybe rewrite it like this:

Suggested change
// hin can be 1, -1 or 0.
// 1 -> 00...01
// 0 -> 00...00
// -1 -> 11...11 (2-complement)
// (Phin, Mhin) can have one of three possible values:
// - (1, 0) if hin is 1.
// - (0, 1) if hin is -1.
// - (0, 0) if hin is 0.

int hout = 0;
// This is instruction below written using 'if': if (Ph & HIGH_BIT_MASK) hout = 1;
hout = (Ph & HIGH_BIT_MASK) >> (WORD_SIZE - 1);
Word Phout = Ph >> (WORD_SIZE - 1);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you removed & HIGH_BIT_MASK, and I was trying to figure why, but now that I am looking at it, it seems it was never useful, was it? So this was always redundant not a change caused by the other changes here, right?
Same for Mhout below.

int bestScore = -1;
vector<int> positions; // TODO: Maybe put this on heap?
const int startHout = mode == EDLIB_MODE_HW ? 0 : 1; // If 0 then gap before query is not penalized;
const Word PstartHout = mode == EDLIB_MODE_HW ? 0 : 1; // If 0 then gap before query is not penalized;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now is a bit harder to track what is start state, since only P part of hout is set at start, and what it the relationship between the start hout and Phout and Mhout.

What if we did this instead:

    int bestScore = -1;
    vector<int> positions; // TODO: Maybe put this on heap?
    // PMhout is (Phout, Mhout): (1,0) if hout is 1, (0,1) if hout is -1, (0,0) if hout is 0.
    // Also, hout == Phout - Mhout.
    const pair<Word,Word> startPMhout = mode == EDLIB_MODE_HW ? {0,0} : {1,0}; // If 0 then gap before query is not penalized;
    const unsigned char* targetChar = target;
    for (int c = 0; c < targetLength; c++) { // for each column
        const Word* Peq_c = Peq + (*targetChar) * maxNumBlocks;

        //----------------------- Calculate column -------------------------//
        pair<Word,Word> PMhout = startPMhout;
        bl = blocks + firstBlock;
        Peq_c += firstBlock;
        for (int b = firstBlock; b <= lastBlock; b++) {
            PMhout = calculateBlock(bl->P, bl->M, *Peq_c, PMhout.first, PMhout.second, bl->P, bl->M);
            bl->score += PMhout.first - PMhout.second; // PMhout.first - PMhout.second == hout
            bl++; Peq_c++;
        }
        bl--; Peq_c--;
        const int hout = PMhout.first - PMhout.second;
        //------------------------------------------------------------------//

So now we have startPMhout at the start, as a single value that describes starting condition, then we have PMhout to track how hout develops as we calculate block by block, and I avoided recalculating hout in every step, we don't really need it in that format, PMhout is actually the format we need. Only at the very end we actually calculate hout, because it is useful for the readability further below.

Comment on lines +612 to +613
pair<Word, Word> PMhout = calculateBlock(bl->P, bl->M, *Peq_c, Phout, Mhout, bl->P, bl->M);
bl->score = (bl - 1)->score - hout + WORD_SIZE + PMhout.first - PMhout.second;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we apply the changes I suggested above, then we also need to adjust this part a bit:

Suggested change
pair<Word, Word> PMhout = calculateBlock(bl->P, bl->M, *Peq_c, Phout, Mhout, bl->P, bl->M);
bl->score = (bl - 1)->score - hout + WORD_SIZE + PMhout.first - PMhout.second;
pair<Word, Word> nextPMhout = calculateBlock(bl->P, bl->M, *Peq_c, PMhout.first, PMhout.second, bl->P, bl->M);
const int nextHout = nextPMhout.first - nextPMhout.second;
bl->score = (bl - 1)->score - hout + WORD_SIZE + nextHout;


//----------------------- Calculate column -------------------------//
int hout = 1;
Word Phout = 1;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If changes I suggested above make sense, we could do the same thing here -> have only PMhout here, no Phout, Mhout, and hout, till we are done with calculating blocks, only then we calculate hout, and below we go with nextPMhout and nextHout.
Oh I see now I used newHout here already -> but let's rename that to nextHout, that is a better name (since it is "next" block).

@RagnarGrootKoerkamp
Copy link
Contributor Author

RagnarGrootKoerkamp commented May 15, 2025

(I'll look more tomorrow, but in the meantime, feel free to push to the branch directly (you should have permission).)

@Martinsos
Copy link
Owner

(I'll look more tomorrow, but in the meantime, feel free to push to the branch directly (you should have permission).)

No worries take your time, and I do want you to take a look anyway before the changes I suggested are applied!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants