Skip to content

Is long vector support necessary? #3

@ahl27

Description

@ahl27

Noting down here a potential issue that may arise in the future that I encountered while working on Biostrings.

A current limitation of Biostrings is the width field of XStringSet I/O functions being limited to 200,002 due to internal I/O buffer sizing. Users have expressed a desire to have unlimited line widths when reading/writing FASTA files, so I'm working on removing it. Edit: this aspect of Biostrings has been resolved.

The internal functions rely on variants of Ocopy functions to copy values in the XStringSet (XVector) object to the I/O buffer. It seems like these functions expect int offsets as input, eg:

void _vector_Ocopy(SEXP out, int out_offset, SEXP in, int in_offset, int nelt,
SEXP lkup, int reverse, int Omode)

On most systems that should autocompile to int32_t, which is limited to a max size of 2^31 (~2 billion). I feel like I could envision a user having a 2GBp sequence -- it would be extremely rare, but not outside the realm of possibility. Attempting to write such a sequence using writeXStringSet would cause integer overflow errors due to int offset indexing.

Upon looking into it more, it actually turns out that this vector would be unlikely to be supported in the first place, since all the XVector derived functions seem to use int indexing internally:

library(Biostrings)
y <- paste(sample(DNA_BASES, 128, replace=TRUE), collapse='')
x <- DNAString(y)
for(i in seq_len(23))
  x <- xscat(x, x)
x ## length 1,073,741,824

 ## would be length 2,147,483,648, but fails because of overflow
xscat(x, x)

## Error in .Call2("XString_xscat", args, PACKAGE = "Biostrings") : 
##  negative length vectors are not allowed

And this extends to all XVector classes:

XInteger(3e9)
## Error in SharedInteger(length = length, val = val) : 
##   'length' must be a single non-negative integer
## In addition: Warning message:
## In .local(.Object, ...) : NAs introduced by coercion to integer range

Note that R itself does support long vectors via double indexing (I believe):

## this uses a lot of RAM, be careful
x <- rep(1L, 3e9)
length(x)
# > 3e+9 

The obvious solution is to move to use the more standard size_t or long (or (u)int64_t). size_t and uint64_t are unsigned, which may change other processing if negative values are used for internal checks in other XVector/IRanges/Biostrings methods. It does seem like some functions already use long values internally, e.g.:

if (lkup == R_NilValue && reverse == 0 && Omode == 0) {
copy_vector_block(out, (long long int) out_offset,
in, (long long int) in_offset,
(long long int) nelt);
return;
}

So that makes me think it may be as simple as just converting int to long int, eg:

void _vector_Ocopy(SEXP out, long int out_offset, SEXP in, long int in_offset, long int nelt,
		SEXP lkup, int reverse, int Omode)

But there would be a ton of places to change this, and I'm not sure what impact it would have downstream.

My larger question is, is this something that's actually needed? If so, how urgently? It's not clear to me how often users are using extremely large vector objects, and if the frequency of their use will increase in the future. At the very least, it's probably worth adding a warning to the documentation somewhere and/or including some more informative error messages to catch the XInteger(3e9) case. Might also be worth inquiring on the Bioconductor slack.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions