Unpack jv values directly#50
Conversation
| cdef int i | ||
| cdef jv ik | ||
| cdef jv iv |
There was a problem hiding this comment.
I try to avoid using variable names like i. Are there more descriptive names that could be used?
There was a problem hiding this comment.
Sure, no problem. Changed those to idx/idx_key/idx_value. Tell me if that's not clear enough.
| while True: | ||
| if not jv_object_iter_valid(v, i): | ||
| break |
There was a problem hiding this comment.
Would while jv_object_iter_valid(v, i): do the same thing?
There was a problem hiding this comment.
D'oh, of course. Forgot to tidy it up after converting jv_object_foreach to Python. Replaced.
| results.append(iterator._next_string()) | ||
| except StopIteration: | ||
| return "\n".join(results) | ||
| return "\n".join(json.dumps(v) for v in self) |
There was a problem hiding this comment.
Is using _jv_to_python and json.dumps faster than using jv_dump_string?
There was a problem hiding this comment.
I haven't tested it, but very likely not. However, it's simpler. Tell me if you'd prefer performance over simplicity here. Provided there's performance benefit.
There was a problem hiding this comment.
I think it'd be useful to know what the performance implications are. I (perhaps naively!) think that it shouldn't introduce too much complexity -- instead of just _next_string, we could have something like _next_result which returns a jv, and then the two different methods (text and __next__) can handle the result appropriately.
There was a problem hiding this comment.
I'll try to throw together a little test for this, but meanwhile I pushed the change and it took me 25 lines:
diff --git a/jq.pyx b/jq.pyx
index f5642fe..54f024a 100644
--- a/jq.pyx
+++ b/jq.pyx
@@ -254,7 +254,13 @@ cdef class _ProgramWithInput(object):
return _ResultIterator(self._jq_state_pool, self._bytes_input)
def text(self):
- return "\n".join(json.dumps(v) for v in self)
+ iterator = self._make_iterator()
+ results = []
+ while True:
+ try:
+ results.append(iterator._next_string())
+ except StopIteration:
+ return "\n".join(results)
def all(self):
return list(self)
@@ -288,7 +294,25 @@ cdef class _ResultIterator(object):
return self
def __next__(self):
+ cdef jv value
+ self._next_jv(&value)
+ return _jv_to_python(value)
+
+ cdef unicode _next_string(self):
cdef int dumpopts = 0
+ cdef jv value
+ cdef jv dump
+
+ self._next_jv(&value)
+ dump = jv_dump_string(value, dumpopts)
+ try:
+ string = jv_string_value(dump).decode("utf8")
+ finally:
+ jv_free(dump)
+
+ return string
+
+ cdef int _next_jv(self, jv *presult) except 1:
while True:
if not self._ready:
self._ready_next_input()
@@ -296,7 +320,8 @@ cdef class _ResultIterator(object):
result = jq_next(self._jq)
if jv_is_valid(result):
- return _jv_to_python(result)
+ presult[0] = result
+ return 0
elif jv_invalid_has_msg(jv_copy(result)):
error_message = jv_invalid_get_msg(result)
message = jv_string_value(error_message).decode("utf8")| try: | ||
| arr.append(_jv_to_python(iv)) | ||
| finally: | ||
| jv_free(iv) |
There was a problem hiding this comment.
Would changing the contract of _jv_to_python to consume its input simplify some of this code?
There was a problem hiding this comment.
Yep. I considered this before in context of memory usage, but couldn't wrap my head around it. Did some experiments and amended it. Should be better now.
Also, I love your usage of "contract" 😀 Nobody would understand me if I tried talking like that at my place 😞.
Add tests verifying various value types and combinations are preserved after passing through the program.
Even though "message-less" invalid values don't technically need "freeing", doing so is supported by the library, and could be good for uniformity and future-proofing.
4ec74ff to
8a5769a
Compare
| cdef jv idx_key | ||
| cdef jv idx_value |
There was a problem hiding this comment.
I don't think these are indices?
There was a problem hiding this comment.
I started them with idx_ to indicate those are key/value at the idx index. As in "index'es key" and "index'es value". What would work better for you?
There was a problem hiding this comment.
Perhaps property_name and property_value?
There was a problem hiding this comment.
Sure. However, this way we would be assigning the return value of jv_object_iter_key() to property_name, which I think would be confusing. Like this:
property_name = jv_object_iter_key(value, idx)So, for the start I renamed them to property_key and property_value. Tell me if you would prefer property_name regardless, and I'll change that.
There was a problem hiding this comment.
I don't think it makes too much difference. I think "name" matches the JSON spec, whereas "key" matches jq.
There was a problem hiding this comment.
Ah, now I see where property_ comes from. Would it be OK that we store array elements in property_value as well?
Anyway, just thought I caught a mismatch here and tried to make it better. I think these are such minor details that it wouldn't matter in the end. I'll just put there whatever you'd like :)
There was a problem hiding this comment.
Ah, sorry, didn't notice you merged this already :D Thanks a lot 🎉!
| results.append(iterator._next_string()) | ||
| except StopIteration: | ||
| return "\n".join(results) | ||
| return "\n".join(json.dumps(v) for v in self) |
There was a problem hiding this comment.
I think it'd be useful to know what the performance implications are. I (perhaps naively!) think that it shouldn't introduce too much complexity -- instead of just _next_string, we could have something like _next_result which returns a jv, and then the two different methods (text and __next__) can handle the result appropriately.
8a5769a to
bf8dafe
Compare
|
Well, whaddaya know: nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m18.325s
user 0m17.004s
sys 0m1.353s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m30.572s
user 0m29.437s
sys 0m1.246sThat's processing a (space-less) 266MB JSON file. First one is the yesterday's version, second one is the one I just pushed. |
|
That's not counting the effect of |
|
Now with a little bit of nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m20.934s
user 0m20.038s
sys 0m1.089s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m32.288s
user 0m31.270s
sys 0m1.172sThis time it's a (loosely-spaced) 716KB JSON file, repeated 500 times. |
|
So, I think the simple version wins, but I'll keep the more complex one posted for now. Tell me what you'd prefer, and also how would you like me to call those |
|
Here are the results with the current master branch for reference: nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m29.568s
user 0m28.468s
sys 0m1.189s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m31.145s
user 0m30.328s
sys 0m0.991s |
|
@mwilliamson, please don't hesitate to tell me if anything else is bothering you about this PR, even if it's just difficult to read/review. I'd be glad to improve it to your satisfaction 😃 I really need this stream parsing implementation to start pushing more data to the new Linux Kernel CI result database 😁 |
|
So using |
bf8dafe to
30531bb
Compare
It seems so. The Python implementation has more users, I suppose, and is more optimized. I pushed the version using Do you want me to rename the |
Instead of dumping and parsing JSON, convert JQ's "jv" structures into Python values directly by recursively walking them. The naive implementation is still twice as fast.
30531bb to
7f6137b
Compare
|
Thanks, merged. |
This adds unpacking jv values into Python values directly, bypassing JSON re-parsing, which is roughly twice as fast.
Compared to #49, this adds some tests checking the various value types and combinations are preserved. This also names the unpacking function
_jv_to_python()instead of_jv_unpack().