Is `long` vector support necessary?

Noting down here a potential issue that may arise in the future that I encountered while working on `Biostrings`.

~~A current limitation of `Biostrings` is the `width` field of `XStringSet` I/O functions being limited to 200,002 due to internal I/O buffer sizing. Users have expressed a desire to have unlimited line widths when reading/writing FASTA files, so I'm working on removing it.~~ Edit: this aspect of Biostrings has been resolved.

The internal functions rely on variants of `Ocopy` functions to copy values in the XStringSet (XVector) object to the I/O buffer. It seems like these functions expect `int` offsets as input, eg:

https://github.com/Bioconductor/XVector/blob/3fa081b9da5402d6033d97572fb07fbafed6c542/src/vector_copy.c#L20-L21

On most systems that should autocompile to `int32_t`, which is limited to a max size of 2^31 (~2 billion). I feel like I could envision a user having a 2GBp sequence -- it would be extremely rare, but not outside the realm of possibility. Attempting to write such a sequence using `writeXStringSet` would cause integer overflow errors due to `int` offset indexing.

Upon looking into it more, it actually turns out that this vector would be unlikely to be supported in the first place, since all the `XVector` derived functions seem to use `int` indexing internally:

```r
library(Biostrings)
y <- paste(sample(DNA_BASES, 128, replace=TRUE), collapse='')
x <- DNAString(y)
for(i in seq_len(23))
  x <- xscat(x, x)
x ## length 1,073,741,824

 ## would be length 2,147,483,648, but fails because of overflow
xscat(x, x)

## Error in .Call2("XString_xscat", args, PACKAGE = "Biostrings") : 
##  negative length vectors are not allowed
```

And this extends to all `XVector` classes:

```r
XInteger(3e9)
## Error in SharedInteger(length = length, val = val) : 
##   'length' must be a single non-negative integer
## In addition: Warning message:
## In .local(.Object, ...) : NAs introduced by coercion to integer range
```

Note that R itself does support long vectors via double indexing (I believe):

```r
## this uses a lot of RAM, be careful
x <- rep(1L, 3e9)
length(x)
# > 3e+9 
```

The obvious solution is to move to use the more standard `size_t` or `long` (or `(u)int64_t`). `size_t` and `uint64_t` are unsigned, which may change other processing if negative values are used for internal checks in other XVector/IRanges/Biostrings methods. It does seem like some functions already use `long` values internally, e.g.:

https://github.com/Bioconductor/XVector/blob/3fa081b9da5402d6033d97572fb07fbafed6c542/src/vector_copy.c#L31-L36

So that makes me think it may be as simple as just converting `int` to `long int`, eg:

```c
void _vector_Ocopy(SEXP out, long int out_offset, SEXP in, long int in_offset, long int nelt,
		SEXP lkup, int reverse, int Omode)
```

But there would be a ton of places to change this, and I'm not sure what impact it would have downstream.

My larger question is, is this something that's actually needed? If so, how urgently? It's not clear to me how often users are using extremely large vector objects, and if the frequency of their use will increase in the future. At the very least, it's probably worth adding a warning to the documentation somewhere and/or including some more informative error messages to catch the `XInteger(3e9)` case. Might also be worth inquiring on the Bioconductor slack.

	if (lkup == R_NilValue && reverse == 0 && Omode == 0) {
	copy_vector_block(out, (long long int) out_offset,
	in, (long long int) in_offset,
	(long long int) nelt);
	return;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is `long` vector support necessary? #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	void _vector_Ocopy(SEXP out, int out_offset, SEXP in, int in_offset, int nelt,
	SEXP lkup, int reverse, int Omode)

Is long vector support necessary? #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Is `long` vector support necessary? #3