Unpack jv values directly by spbnick · Pull Request #50 · mwilliamson/jq.py

spbnick · 2020-09-09T10:22:37Z

This adds unpacking jv values into Python values directly, bypassing JSON re-parsing, which is roughly twice as fast.

Compared to #49, this adds some tests checking the various value types and combinations are preserved. This also names the unpacking function _jv_to_python() instead of _jv_unpack().

mwilliamson · 2020-09-09T10:32:13Z

+    cdef int i
+    cdef jv ik
+    cdef jv iv


I try to avoid using variable names like i. Are there more descriptive names that could be used?

Sure, no problem. Changed those to idx/idx_key/idx_value. Tell me if that's not clear enough.

mwilliamson · 2020-09-09T10:33:14Z

+        while True:
+            if not jv_object_iter_valid(v, i):
+                break


Would while jv_object_iter_valid(v, i): do the same thing?

D'oh, of course. Forgot to tidy it up after converting jv_object_foreach to Python. Replaced.

mwilliamson · 2020-09-09T10:35:09Z

-                results.append(iterator._next_string())
-            except StopIteration:
-                return "\n".join(results)
+        return "\n".join(json.dumps(v) for v in self)


Is using _jv_to_python and json.dumps faster than using jv_dump_string?

I haven't tested it, but very likely not. However, it's simpler. Tell me if you'd prefer performance over simplicity here. Provided there's performance benefit.

I think it'd be useful to know what the performance implications are. I (perhaps naively!) think that it shouldn't introduce too much complexity -- instead of just _next_string, we could have something like _next_result which returns a jv, and then the two different methods (text and __next__) can handle the result appropriately.

I'll try to throw together a little test for this, but meanwhile I pushed the change and it took me 25 lines:

diff --git a/jq.pyx b/jq.pyx index f5642fe..54f024a 100644 --- a/jq.pyx +++ b/jq.pyx @@ -254,7 +254,13 @@ cdef class _ProgramWithInput(object): return _ResultIterator(self._jq_state_pool, self._bytes_input) def text(self): - return "\n".join(json.dumps(v) for v in self) + iterator = self._make_iterator() + results = [] + while True: + try: + results.append(iterator._next_string()) + except StopIteration: + return "\n".join(results) def all(self): return list(self) @@ -288,7 +294,25 @@ cdef class _ResultIterator(object): return self def __next__(self): + cdef jv value + self._next_jv(&value) + return _jv_to_python(value) + + cdef unicode _next_string(self): cdef int dumpopts = 0 + cdef jv value + cdef jv dump + + self._next_jv(&value) + dump = jv_dump_string(value, dumpopts) + try: + string = jv_string_value(dump).decode("utf8") + finally: + jv_free(dump) + + return string + + cdef int _next_jv(self, jv *presult) except 1: while True: if not self._ready: self._ready_next_input() @@ -296,7 +320,8 @@ cdef class _ResultIterator(object): result = jq_next(self._jq) if jv_is_valid(result): - return _jv_to_python(result) + presult[0] = result + return 0 elif jv_invalid_has_msg(jv_copy(result)): error_message = jv_invalid_get_msg(result) message = jv_string_value(error_message).decode("utf8")

mwilliamson · 2020-09-09T10:40:23Z

+            try:
+                arr.append(_jv_to_python(iv))
+            finally:
+                jv_free(iv)


Would changing the contract of _jv_to_python to consume its input simplify some of this code?

Yep. I considered this before in context of memory usage, but couldn't wrap my head around it. Did some experiments and amended it. Should be better now.

Also, I love your usage of "contract" 😀 Nobody would understand me if I tried talking like that at my place 😞.

Add tests verifying various value types and combinations are preserved after passing through the program.

Even though "message-less" invalid values don't technically need "freeing", doing so is supported by the library, and could be good for uniformity and future-proofing.

mwilliamson · 2020-09-09T17:45:24Z

+    cdef jv idx_key
+    cdef jv idx_value


I don't think these are indices?

I started them with idx_ to indicate those are key/value at the idx index. As in "index'es key" and "index'es value". What would work better for you?

Perhaps property_name and property_value?

Sure. However, this way we would be assigning the return value of jv_object_iter_key() to property_name, which I think would be confusing. Like this:

property_name = jv_object_iter_key(value, idx)

So, for the start I renamed them to property_key and property_value. Tell me if you would prefer property_name regardless, and I'll change that.

I don't think it makes too much difference. I think "name" matches the JSON spec, whereas "key" matches jq.

Ah, now I see where property_ comes from. Would it be OK that we store array elements in property_value as well?
Anyway, just thought I caught a mismatch here and tried to make it better. I think these are such minor details that it wouldn't matter in the end. I'll just put there whatever you'd like :)

Ah, sorry, didn't notice you merged this already :D Thanks a lot 🎉!

mwilliamson · 2020-09-09T17:50:09Z

-                results.append(iterator._next_string())
-            except StopIteration:
-                return "\n".join(results)
+        return "\n".join(json.dumps(v) for v in self)


I think it'd be useful to know what the performance implications are. I (perhaps naively!) think that it shouldn't introduce too much complexity -- instead of just _next_string, we could have something like _next_result which returns a jv, and then the two different methods (text and __next__) can handle the result appropriately.

spbnick · 2020-09-10T10:11:18Z

Well, whaddaya know:

nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'                           
real    0m18.325s
user    0m17.004s
sys     0m1.353s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m30.572s
user    0m29.437s
sys     0m1.246s

That's processing a (space-less) 266MB JSON file. First one is the yesterday's version, second one is the one I just pushed.

spbnick · 2020-09-10T10:14:16Z

That's not counting the effect of "\n".join(...), of course.

spbnick · 2020-09-10T10:21:39Z

Now with a little bit of join, similar results:

nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m20.934s
user    0m20.038s
sys     0m1.089s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m32.288s
user    0m31.270s
sys     0m1.172s

This time it's a (loosely-spaced) 716KB JSON file, repeated 500 times.

spbnick · 2020-09-10T10:24:04Z

So, I think the simple version wins, but I'll keep the more complex one posted for now. Tell me what you'd prefer, and also how would you like me to call those idx_key and idx_value vars. Thanks :)

spbnick · 2020-09-10T11:39:19Z

Here are the results with the current master branch for reference:

nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m29.568s
user    0m28.468s
sys     0m1.189s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m31.145s
user    0m30.328s
sys     0m0.991s

spbnick · 2020-09-11T09:48:14Z

@mwilliamson, please don't hesitate to tell me if anything else is bothering you about this PR, even if it's just difficult to read/review. I'd be glad to improve it to your satisfaction 😃

I really need this stream parsing implementation to start pushing more data to the new Linux Kernel CI result database 😁

mwilliamson · 2020-09-13T14:21:46Z

So using json.dumps is faster than jv_string_value? If so, then that seems like the right way to go.

spbnick · 2020-09-13T14:29:45Z

So using json.dumps is faster than jv_string_value? If so, then that seems like the right way to go.

It seems so. The Python implementation has more users, I suppose, and is more optimized. I pushed the version using json.dumps now.

Do you want me to rename the idx_key and idx_value vars into something else?

Instead of dumping and parsing JSON, convert JQ's "jv" structures into Python values directly by recursively walking them. The naive implementation is still twice as fast.

mwilliamson · 2020-09-13T15:22:40Z

Thanks, merged.

spbnick mentioned this pull request Sep 9, 2020

Add parser support #49

Open

mwilliamson reviewed Sep 9, 2020

View reviewed changes

spbnick added 2 commits September 9, 2020 13:45

Add tests for data preservation

52564dd

Add tests verifying various value types and combinations are preserved after passing through the program.

Free invalid values

a24ce4c

Even though "message-less" invalid values don't technically need "freeing", doing so is supported by the library, and could be good for uniformity and future-proofing.

spbnick force-pushed the unpack_jv_values_directly branch from 4ec74ff to 8a5769a Compare September 9, 2020 14:04

mwilliamson reviewed Sep 9, 2020

View reviewed changes

spbnick force-pushed the unpack_jv_values_directly branch from 8a5769a to bf8dafe Compare September 10, 2020 09:50

spbnick force-pushed the unpack_jv_values_directly branch from bf8dafe to 30531bb Compare September 13, 2020 14:28

Unpack jv values into Python values directly

7f6137b

Instead of dumping and parsing JSON, convert JQ's "jv" structures into Python values directly by recursively walking them. The naive implementation is still twice as fast.

spbnick force-pushed the unpack_jv_values_directly branch from 30531bb to 7f6137b Compare September 13, 2020 15:04

mwilliamson merged commit 006b54f into mwilliamson:master Sep 13, 2020

Conversation

spbnick commented Sep 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spbnick Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spbnick commented Sep 10, 2020

Uh oh!

spbnick commented Sep 10, 2020

Uh oh!

spbnick commented Sep 10, 2020

Uh oh!

spbnick commented Sep 10, 2020

Uh oh!

spbnick commented Sep 10, 2020

Uh oh!

spbnick commented Sep 11, 2020

Uh oh!

mwilliamson commented Sep 13, 2020

Uh oh!

spbnick commented Sep 13, 2020

Uh oh!

mwilliamson commented Sep 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spbnick Sep 10, 2020 •

edited

Loading