You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are three types of aggregate functions: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_.
15
18
16
19
## Tile Mean Example
17
20
18
-
We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
21
+
We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_. The _tiles_ will contain normally distributed cell values with the first row's mean at 1.0 and the second row's mean at 3.0. For details on use of the `Tile` class see @ref:[the page on numpy interoperability](numpy-pandas.md).
19
22
20
-
```python, sql_dataframe
21
-
import pyspark.sql.functions as F
23
+
```python, create_tile1
24
+
from pyrasterframes.rf_types import Tile, CellType
We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
36
+
Create a Spark DataFrame from the Tile objects.
37
+
38
+
```python, create_dataframe
39
+
import pyspark.sql.functions as F
40
+
from pyspark.sql import Row
41
+
42
+
rf = spark.createDataFrame([
43
+
Row(id=1, tile=t1),
44
+
Row(id=2, tile=t5)
45
+
]).orderBy('id')
46
+
```
47
+
48
+
We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is about 1.0 and the second mean is about 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
34
49
35
50
```python, tile_mean
36
-
means = rf.select(F.col('id'), rf_tile_mean(F.col('tile')))
We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
54
+
We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages values across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
41
55
42
56
```python, agg_mean
43
-
mean = rf.agg(rf_agg_mean(F.col('tile')))
44
-
mean
57
+
rf.agg(rf_agg_mean(F.col('tile')))
45
58
```
46
59
47
60
We use the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. For this aggregation, we are computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
48
61
49
62
To compute an element-wise local aggregate, _tiles_ need to have the same dimensions. In this case, both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
50
63
51
64
```python, local_mean
52
-
t = rf.agg(rf_agg_local_mean(F.col('tile')).alias('local_mean')) \
53
-
.collect()[0]['local_mean']
54
-
print(t.cells)
65
+
rf.agg(rf_agg_local_mean('tile')) \
66
+
.first()[0].cells.data # display the contents of the Tile array
55
67
```
56
68
57
69
## Cell Counts Example
@@ -92,12 +104,11 @@ stats
92
104
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
assert df.select('crs').first() is not None, "example dataframe is going to be empty"
24
+
```
25
+
3
26
RasterFrames® provides a DataFrame-centric view over arbitrary Earth-observation (EO) data, enabling spatiotemporal queries, map algebra raster operations, and compatibility with the ecosystem of [Apache Spark](https://spark.apache.org/docs/latest/) [ML](https://spark.apache.org/docs/latest/ml-guide.html) algorithms. It provides APIs in @ref:[Python, SQL, and Scala](languages.md), and can scale from a laptop computer to a large distributed cluster, enabling _global_ analysis with satellite imagery in a wholly new, flexible, and convenient way.
4
27
5
28
## Context
@@ -29,7 +52,9 @@ RasterFrames introduces georectified raster imagery to Spark SQL. It quantizes s
29
52
30
53
As shown in the figure below, a "RasterFrame" is a Spark DataFrame with one or more columns of type @ref:[_tile_](concepts.md#tile). A _tile_ column typically represents a single frequency band of sensor data, such as "blue" or "near infrared", but can also be quality assurance information, land classification assignments, or any other raster spatial data. Along with _tile_ columns there is typically an @ref:[`extent`](concepts.md#extent) specifying the geographic location of the data, the map projection of that geometry (@ref:[`crs`](concepts.md#coordinate-reference-system--crs-)), and a `timestamp` column representing the acquisition time. These columns can all be used in the `WHERE` clause when filtering.
RasterFrames also includes support for working with vector data, such as [GeoJSON][GeoJSON]. RasterFrames vector data operations let you filter with geospatial relationships like contains or intersects, mask cells, convert vectors to rasters, and more.
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/vector-data.pymd
+4-15
Original file line number
Diff line number
Diff line change
@@ -57,37 +57,26 @@ Since it is a geometry we can do things like this:
57
57
the_first['geometry'].wkt
58
58
```
59
59
60
-
You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry.
60
+
You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple **but inefficient** example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry. Observe in a sample of the data the geometry columns print as well known text (wkt).
As documented in the @ref:[function reference](reference.md), various user-defined functions implemented by GeoMesa are also available for use. The example below uses a GeoMesa user-defined function to compute the centroid of a geometry. It is logically equivalent to the example above, but more efficient.
84
76
85
-
86
77
```python, native_centroid
87
78
from pyrasterframes.rasterfunctions import st_centroid
The RasterFrames vector functions and GeoMesa functions also provide a variety of spatial relations that are useful in combination with the geometric properties of projected rasters. In this example, we use the @ref:[built-in Landsat catalog](raster-catalogs.md#using-built-in-experimental-catalogs) which provides an extent. We will convert the extent to a polygon and filter to those within approximately 500 km of a selected point.
0 commit comments