-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
adds the nth
function for iterables
#56580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
base/iterators.jl
Outdated
""" | ||
nth(itr, n::Integer) | ||
|
||
Get the `n`th element of an iterable collection. Return `nothing` if not existing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning nothing
makes it impossible to distinguish between "the nth element was nothing
", and "there was no nth element". Perhaps return Union{Nothing, Some}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point.
Should it be Union{nothing, Some}
even in those cases where we know there can't be a nothing
value in the iterator (for sake of uniform api)? I.e. Count
Iterator or Repeated
(with its element different than nothing) or AbstractRanges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should, otherwise it would be too confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just throw an error if there is no n
th element. There could also be a default
argument as in get
, where a user can pass a value that should be returned if no n
th element exists.
I don't really follow the logic that the spirit of iterators is to return nothing
in such cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree nothing
is weird, your iterator can produce that. Some
seems a bit technical & unfriendly? An error seems fine. Matches what first([])
does.
I suppose it can't literally be a method of get
since it goes by enumeration not keys:
julia> first(Dict('a':'z' .=> 'A':'Z'), 3)
3-element Vector{Pair{Char, Char}}:
'n' => 'N'
'f' => 'F'
'w' => 'W'
julia> nth(Dict('a':'z' .=> 'A':'Z'), 3)
'w' => 'W'
base/iterators.jl
Outdated
``` | ||
""" | ||
nth(itr, n::Integer) = _nth(IteratorSize(itr), itr, n) | ||
nth(itr::AbstractArray, n::Integer) = n > length(itr) ? nothing : itr[n] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes one-based indexing. Perhaps do itr[begin + n - 1]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are absolutely correct.
would something like getindex(itr, nth(eachindex(IndexLinear(), itr), n))
be too overkill?
and adding a specialization with nth(itr::AbstractRange, n::Integer) = getindex(itr, n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the probably overkill approach, if it's too much i'll revert back to your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbstractRange
s are not always one-based either, so that approach runs into the same issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I could gather that is included in the getindex
already, since it ends up calling
unsafe_getindex(v::AbstractRange{T}, i::Integer) where T = convert(T, first(v) + (i - oneunit(i))*step_hp(v))
which should pretty much be the same sa [begin + n -1]
unless I'm missing the point completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line nth(itr::AbstractRange, n) = getindex(itr, n)
will for sure fail on the axes of an OffsetArray. (In fact, it will first be ambiguous, as n::Any
is less specific.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was overthinking it, I'll just stick with [begin + n - 1]
. Sorry.
how would this compare to a more naive implementation like
? |
|
Seems like a lot of code. I reproduced the above benchmark here: No strong position on whether this needs a name or not, but perhaps this first PR can focus on that, and let the implementation be just: nth(itr, n::Integer) = first(Iterators.drop(itr, n-1))
nth(itr::AbstractArray, n::Integer) = itr[begin-1+n] |
A lot of the code is for optimizing out of bound checking. If we go with davidantoff suggestion of letting |
I disagree with throwing an error. In cases where you don't know if an nth element exists, that forces a try-catch which is both slow and brittle. I would imagine that most ordered iterators with a known length support indexing, so this would probably mostly be used precisely when the length is unknown. |
I think another consideration here is consistency: the other functions we have that take an individual element from an iterator are I agree with @jakobnissen that in some situations being able to handle this without an exception would be nice, but on the flip side, I can also see scenarios where an error seems much better, in particular in interactive sessions where I might be playing around with some data and this function could be very useful. And especially in an interactive scenario it would be super inconvenient if Maybe the best design would be to allow for both scenarios. Say something like nth(itr, n, nothrow=false) So the default would be that an exception is thrown if the |
We could also opt for relying on the
Although I see the similarity with
the error in lastindex(a::AbstractArray) = (@inline; last(eachindex(IndexLinear(), a))) # equals to last(OneTo(0)) Similarly, both From this my idea that in principle iterators are non throwing by default, any throwing should be done one level higher and not at the iterator level itself (like how |
I have to admit, I think that is the option I like least of all of the proposed options so far :) It would make it very tricky to write generic code that uses the
To me
Agreed, but the whole difference between I still think that my proposal with an argument like |
Is there any precedent for a We could also follow |
I think it's already hard to write generic code that covers both generic collections and
Not really, I don't have particularly hard opinions about it. In the original issue I had proposed something similar with
My proposal for |
throwing together some "PR litterature review" for cross reference since I think this PR can depend/interact on/with these:
EDIT: |
Having thought about it, I do have some sympathy for the argument of @davidanthoff that it should behave like I do see myself wanting to use it in code like: fourth_field = @something nth(eachsplit(line, '\t'), 4) throw(FormatError(
lazy"Line $lineno does not contain four tab-separated fields fields"
)) Which would now instead be fourth_field = first(@something iterate(drop(eachsplit(line, '\t'), 3)) throw(FormatError(
lazy"Line $lineno does not contain four tab-separated fields fields"
))) That's certainly doable (especially since, for iterators of unknown length, most of the clever tricks that |
What is the semantic difference between this function and |
matching |
Yes, agreed! Having two distinct functions probably also helps with type stability. Another naming scheme I thought about is |
Julia has a bunch of patterns for handling this already, so one has some freedom to choose "consistent with what?" :) |
I see 4 ways of handling errors in Julia:
Personally I'd be happy with Base having both |
The 5th option is Union{T,S} where you supply S -- like
It takes the iteration count, not the index. (Same on Vector, different on Dict, or OffsetArray.)
That's one's not as bad, as it's either an index or |
I see, thanks.
The problem with "just assume users do what you expect" is that (1) nobody ever documents what they expect and (2) even documented it increases the complexity of usage. No library function using
|
Might be nice to accept an |
But I agree with |
Nonscalar indexing |
I presume the point of How well |
Bump. |
added nth to export list
add docs explaining interaction with Stateful iterators change test to be Any vectors instead of tuples (actually way faster as well)
chore: small reordering. feat: nth is only throwing. with explicit errors.
Co-authored-by: adienes <[email protected]>
Co-authored-by: Steven G. Johnson <[email protected]>
the only non-explicitly-triage-approved part I can see is the inclusion of the curried 1-arg method, personally, I don't have a particularly strong opinion in magnitude but I do agree in direction with @stevengj that it's a bit inconsistent given that there are basically no other 1-arg methods like this in does it need another triage discussion? probably not --- maybe just a show of 👍👎 somewhere suffices? |
end | ||
|
||
# finite cycle: in reality a Flatten{Take{Repeated{O}}} iterator | ||
function nth(itr::Flatten{Take{Repeated{O}}}, n::Integer) where {O} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a clear need for all these specialisations? They seem quite exotic... do they show up? Are they really almost always faster? Are they in fact all well-tested?
When I looked at the benchmarks earlier, the total time was driven by one case, so didn't answer these questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Flatten{Take{Repeated{O}}
is the actual type of a finite cycle, as in Iterators.cycle(itr, n)
.
In your benchmark the total time was driven by the infinite cycle, the second biggest contribution was the finite cycle by a 36x factor (72ns vs 2ns) and allocating vs non allocating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more concrete, can you please post benchmarks for each of the methods in the present PR, comparing to the most basic implementation? Ideally at varying lengths etc. to see if small/large is different.
Flatten{Take{Repeated{O}} is the actual type of a finite cycle, as in Iterators.cycle(itr, n).
That is also what the comment should say, not repeat the signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you refer to the Count
Repeated
and Take{<:Repeated}
specializations, at the time I had some small testing and the performance was always matching the naive or better.
I've rerun some tests and now only the Take{<:Repeated}
specialization has somewhat of a slightly better performance. But yes, probably also that doesn't warrant a specialization anymore.
julia> @benchmark nth($repeateditr, 9235)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 3.042 ns … 8.750 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.167 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.172 ns ± 0.097 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅ █ ▅ ▃ ▁
▃▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▇ █
3.04 ns Histogram: log(frequency) by time 3.29 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark _nth($repeateditr, 9235)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 3.333 ns … 14.166 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.458 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.468 ns ± 0.331 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ █ █ ▃ ▂▃ ▂ ▁ ▁ ▂
▃▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁██▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▇▁▁▁▁▇ █
3.33 ns Histogram: log(frequency) by time 3.75 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refer to every single specialisation. Each one needs to be individually justified. Again, on long cases, and short cases, and weird types (cycle of a string? cycle of a tuple?). And ideally in some file others can easily try out on other machines.
Those that survive also need to be individually tested. Again, with weird cases (cycle of an empty vector?) poking for weak points. if IteratorSize(I) isa Union{HasShape, HasLength}
means you must invent cases taking both paths. Testing only-just out-of-bounds in both directions. And written in a way that makes it obvious to the reader of the test which specialisation is being tested, and where the true values are coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the specialization errors behave like the simple case? as in: throwing an ArgumentError("Drop length must be non-negative")
on negative indices and ArgumentError("collection must be non-empty")
when "out of bounds". It is a bit artificial since the non-empty error stems from the drop iterator returning an empty collection if called out out bounds. We're basically simulating what the error would be if the first(drop())
implementation was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this looks great, will read slowly.
Haven't thought hard re errors, it does seem friendly to have something uniform & deliberate. But how much complexity it's worth I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cycles of strings might need an extra specialization since the issue is length(::String)
, it would probably require doing some heuristic using ncodeunits(::String)
to decide when starting to use the specialization.
But at least, we know already that a String HasLength
so it's basically a "complexity swap"...
I am not sure if it is worth it given the relative niche usage of these iterators.
But... for longer strings the difference is pretty big (ms vs ns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the performance could be matched by specializing first
and/or drop
instead of nth
on these types from Iterators
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cycle index recalculations could indeed be specialized for drop
directly. That's a good point.
Although cycles of strings would still need some extra specialization.
Upon further reflection, I don't think it can work. drop
must return an iterator that generates all but the first n
elements. So although a finite cycle iterates through the same shorter iterable, drop must still produce all elements after those we skip. The shortcircuit we do only works because we don't care about the rest of the drop iterator, only it's first element.
as an example, if we have it = cycle([1,2,3], 3)
that is equivalent to [1,2,3,1,2,3,1,2,3]
and although it[2] == it[5]
we can't say the same about Drop(it, 1)
and Drop(it, 4)
RFC: should this PR introduce the single argument version of 👍 or 👎 this comment |
I propose a compromise where only nth(n::Integer) = Fix2(nth, n) is defined (mind the type annotation). This would allow for most (all?) of the use cases of getting different elements of the same iterables while still leaving room for getting the |
I am not sure I understand what you are proposing. The single argument version is already implemented in this PR and it accept only integers.
The single argument version would be useful to define a function that returns the But maybe I missed your point completely. |
Thanks, I hadn't seen this. However, the point still stands: The question should not be whether to define There seem to be two useful definitions for a single argument
The problem of the second version is that we do not have an The second version might be useful in cases where the iteration count is the data in your data flow and the iterator is the thing which this data operates on, like in this contrived example function abs_sum_some_selected(itr, select::Function)
select() |> filter(>(2)) .|> nth(itr) .|> abs |> sum
end
julia> abs_sum_some_selected(-10:-1:-15, ()->(2, 3, 4))
25 I am not proposing to add the second version to Base with this PR. I am proposing to add the first version constrained to |
There was never the intention of adding the second version you mention. There is no ambiguity in |
Although I seem to have difficulties to write down my point in an understandable way and the method discussed here still does not match the implemented method, I think the important point is that we agree about what is currently proposed in this PR as an implementation. nth(n::Integer) = Fix2(nth, n) and consider this part of the discussion as resolved. |
Hi,
I've turned the open ended issue #54454 into an actual PR.
Tangentially related to #10092 ?
This PR introduces the
nth(itr, n)
function to iterators to give agetindex
type of behaviour.I've tried my best to optimize as much as possible by specializing on different types of iterators.
In the spirit of iterators any OOB access returns
nothing
. (edit: instead of throwing an error, i.e.first(itr, n)
andlast(itr, n)
)here is the comparison of running the testsuite (~22 different iterators) using generic
nth
and specializednth
: