Skip to content

Commit 40171ed

Browse files
authored
big_data.rst summary
1 parent aa4d808 commit 40171ed

File tree

1 file changed

+23
-39
lines changed

1 file changed

+23
-39
lines changed

docs/day3/big_data.rst

Lines changed: 23 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@
33
Big data with Python
44
====================
55

6-
7-
dfiifdio djfoidj oid mjisjfsdfiasdf
8-
96
.. admonition:: "Learning outcomes"
107

118
Learners
@@ -16,6 +13,7 @@ dfiifdio djfoidj oid mjisjfsdfiasdf
1613
- know where to learn more
1714

1815
.. admonition:: "For teacher"
16+
:class: dropdown
1917

2018
Preliminary timings. Starting at 13.00
2119

@@ -688,13 +686,22 @@ dask.arrays
688686

689687
.. admonition:: Chunks
690688

691-
- Dask divides arrays into many small pieces (chunks), as small as necessary to
692-
fit it into memory.
689+
- Dask divides arrays into many small pieces (chunks), as small as necessary to fit it into memory.
693690
- Operations are delayed (**lazy computing**) e.g.
694691

695-
- tasks are queue and no computation is performed until you actually ask values to be computed (for instance print mean values).
692+
- tasks are queued and no computation is performed until you actually ask values to be computed (for instance print mean values).
696693
- Then data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.
697-
- and data is gathered in the end.
694+
- And data is gathered in the end.
695+
- Tools like Dask and xarray handle "chunking" automatically.
696+
- Note that number of chunks does not need to be equal to number of cores.
697+
698+
Big file → split into chunks → parallel workers → results combined.
699+
700+
.. admonition:: To think of
701+
702+
- Chunk size and number of them affect the performance due to overhad/administration of the chunking and combination.
703+
- Briefly explain what happens when a Dask job runs on multiple cores.
704+
698705

699706
.. discussion:: Example from dask.org
700707

@@ -708,19 +715,7 @@ dask.arrays
708715
# It runs using multiple threads on one node.
709716
# It could also be distributed to multiple nodes
710717
711-
Chunking
712-
::::::::
713-
714-
Big file → split into chunks → parallel workers → results combined.
715-
716-
- Tools like Dask and xarray handle "chunking" automatically.
717-
- Note that number of chunks does not need to be equal to number of cores.
718-
719-
.. admonition:: To think of
720-
721-
- chunk size and number of them affect the performance due to overhad/administration of the chunking and combination.
722718
723-
- Briefly explain what happens when a Dask job runs on multiple cores.
724719
725720
Polars package
726721
..............
@@ -745,9 +740,8 @@ Polars package
745740

746741
https://pola.rs/
747742

748-
Exercises
749-
---------
750-
743+
Exercises: Packages
744+
-------------------
751745

752746
.. challenge:: Chunk sizes in Dask
753747

@@ -863,9 +857,6 @@ Exercises
863857
Summary
864858
-------
865859

866-
Workflow
867-
........
868-
869860
.. discussion:: Follow-up discussion
870861

871862
- New learnings?
@@ -874,30 +865,23 @@ Workflow
874865
- Data-chunking as technique if not enough RAM
875866
- Is Xarray/Polars/Dask useful for you?
876867

877-
.. admonition:: Sum up
878-
879-
- Load Python modules and activate virtual environments.
880-
- Request appropriate memory and runtime in SLURM.
881-
- Store temporary data in local scratch ($SNIC_TMP).
882-
- Check job memory usage with ``sacct`` or ``sstat``.
883-
884-
Data source → Format choice → Load/Chunk → Process → Write
885-
886868
.. keypoints::
887869

870+
- Allocate more RAM by asking for
871+
- Several cores
872+
- Nodes will more RAM
873+
- Check job memory usage with ``sacct`` or ``sstat``. Check you documentation!
888874
- File formats
889875
- No format fits all requirements
890-
- HDF5 and NetCDF good for Big data
876+
- HDF5 and NetCDF good for Big data since it allows loading parts of the file into memory
877+
- Store temporary data in local scratch ($SNIC_TMP).
891878
- Packages
892879
- xarray
893880
- can deal with 3D-data and higher dimensions
894881
- Dask
895882
- uses lazy execution
896883
- Only use for processing very large amount of data
897-
- Allocate more RAM by asking for
898-
- Several cores
899-
- Nodes will more RAM
900-
884+
- Chunking: Data source → Format choice → Load/Chunk → Process → Write
901885

902886
.. seealso::
903887

0 commit comments

Comments
 (0)