33Big data with Python
44====================
55
6-
7- dfiifdio djfoidj oid mjisjfsdfiasdf
8-
96.. admonition :: "Learning outcomes"
107
118 Learners
@@ -16,6 +13,7 @@ dfiifdio djfoidj oid mjisjfsdfiasdf
1613 - know where to learn more
1714
1815.. admonition :: "For teacher"
16+ :class: dropdown
1917
2018 Preliminary timings. Starting at 13.00
2119
@@ -688,13 +686,22 @@ dask.arrays
688686
689687.. admonition :: Chunks
690688
691- - Dask divides arrays into many small pieces (chunks), as small as necessary to
692- fit it into memory.
689+ - Dask divides arrays into many small pieces (chunks), as small as necessary to fit it into memory.
693690 - Operations are delayed (**lazy computing **) e.g.
694691
695- - tasks are queue and no computation is performed until you actually ask values to be computed (for instance print mean values).
692+ - tasks are queued and no computation is performed until you actually ask values to be computed (for instance print mean values).
696693 - Then data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.
697- - and data is gathered in the end.
694+ - And data is gathered in the end.
695+ - Tools like Dask and xarray handle "chunking" automatically.
696+ - Note that number of chunks does not need to be equal to number of cores.
697+
698+ Big file → split into chunks → parallel workers → results combined.
699+
700+ .. admonition :: To think of
701+
702+ - Chunk size and number of them affect the performance due to overhad/administration of the chunking and combination.
703+ - Briefly explain what happens when a Dask job runs on multiple cores.
704+
698705
699706.. discussion :: Example from dask.org
700707
@@ -708,19 +715,7 @@ dask.arrays
708715 # It runs using multiple threads on one node.
709716 # It could also be distributed to multiple nodes
710717
711- Chunking
712- ::::::::
713-
714- Big file → split into chunks → parallel workers → results combined.
715-
716- - Tools like Dask and xarray handle "chunking" automatically.
717- - Note that number of chunks does not need to be equal to number of cores.
718-
719- .. admonition :: To think of
720-
721- - chunk size and number of them affect the performance due to overhad/administration of the chunking and combination.
722718
723- - Briefly explain what happens when a Dask job runs on multiple cores.
724719
725720 Polars package
726721..............
@@ -745,9 +740,8 @@ Polars package
745740
746741 https://pola.rs/
747742
748- Exercises
749- ---------
750-
743+ Exercises: Packages
744+ -------------------
751745
752746.. challenge :: Chunk sizes in Dask
753747
@@ -863,9 +857,6 @@ Exercises
863857Summary
864858-------
865859
866- Workflow
867- ........
868-
869860.. discussion :: Follow-up discussion
870861
871862 - New learnings?
@@ -874,30 +865,23 @@ Workflow
874865 - Data-chunking as technique if not enough RAM
875866 - Is Xarray/Polars/Dask useful for you?
876867
877- .. admonition :: Sum up
878-
879- - Load Python modules and activate virtual environments.
880- - Request appropriate memory and runtime in SLURM.
881- - Store temporary data in local scratch ($SNIC_TMP).
882- - Check job memory usage with ``sacct `` or ``sstat ``.
883-
884- Data source → Format choice → Load/Chunk → Process → Write
885-
886868.. keypoints ::
887869
870+ - Allocate more RAM by asking for
871+ - Several cores
872+ - Nodes will more RAM
873+ - Check job memory usage with ``sacct `` or ``sstat ``. Check you documentation!
888874 - File formats
889875 - No format fits all requirements
890- - HDF5 and NetCDF good for Big data
876+ - HDF5 and NetCDF good for Big data since it allows loading parts of the file into memory
877+ - Store temporary data in local scratch ($SNIC_TMP).
891878 - Packages
892879 - xarray
893880 - can deal with 3D-data and higher dimensions
894881 - Dask
895882 - uses lazy execution
896883 - Only use for processing very large amount of data
897- - Allocate more RAM by asking for
898- - Several cores
899- - Nodes will more RAM
900-
884+ - Chunking: Data source → Format choice → Load/Chunk → Process → Write
901885
902886.. seealso ::
903887
0 commit comments