Skip to content

Commit c04f3b5

Browse files
added pyspark UDF example
1 parent 6a2d4ab commit c04f3b5

32 files changed

+57
-0
lines changed

tutorial/.DS_Store

100644100755
File mode changed.

tutorial/add-indices/add-indices.txt

100644100755
File mode changed.

tutorial/basic-average/basic-average.txt

100644100755
File mode changed.

tutorial/basic-filter/basic-filter.txt

100644100755
File mode changed.

tutorial/basic-join/basicjoin.txt

100644100755
File mode changed.

tutorial/basic-map/basic-map.txt

100644100755
File mode changed.

tutorial/basic-multiply/basic-multiply.txt

100644100755
File mode changed.

tutorial/basic-sort/sort-by-key.txt

100644100755
File mode changed.

tutorial/basic-sum/basic-sum.txt

100644100755
File mode changed.

tutorial/basic-union/basic-union.txt

100644100755
File mode changed.

tutorial/bigrams/bigrams.txt

100644100755
File mode changed.

tutorial/cartesian/cartesian.txt

100644100755
File mode changed.

tutorial/combine-by-key/README.md

100644100755
File mode changed.

tutorial/combine-by-key/combine-by-key.txt

100644100755
File mode changed.

tutorial/combine-by-key/distributed_computing_with_spark_by_Javier_Santos_Paniego.pdf

100644100755
File mode changed.

tutorial/combine-by-key/spark-combineByKey.md

100644100755
File mode changed.

tutorial/combine-by-key/spark-combineByKey.txt

100644100755
File mode changed.

tutorial/combine-by-key/standard_deviation_by_combineByKey.md

100644100755
File mode changed.

tutorial/dna-basecount/README.md

100644100755
File mode changed.

tutorial/dna-basecount/dna-basecount.md

100644100755
File mode changed.

tutorial/dna-basecount/dna-basecount2.md

100644100755
File mode changed.

tutorial/dna-basecount/dna-basecount3.md

100644100755
File mode changed.

tutorial/dna-basecount/dna_seq.txt

100644100755
File mode changed.

tutorial/map-partitions/README.md

100644100755
File mode changed.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
$SPARK_HOME/bin/pyspark
2+
Python 3.8.9 (default, Nov 9 2021, 04:26:29)
3+
Welcome to
4+
____ __
5+
/ __/__ ___ _____/ /__
6+
_\ \/ _ \/ _ `/ __/ '_/
7+
/__ / .__/\_,_/_/ /_/\_\ version 3.2.0
8+
/_/
9+
10+
Using Python version 3.8.9 (default, Nov 9 2021 04:26:29)
11+
Spark context Web UI available at http://10.0.0.232:4040
12+
Spark context available as 'sc' (master = local[*], app id = local-1641011178190).
13+
SparkSession available as 'spark'.
14+
15+
>>> from pyspark.sql import Row
16+
17+
>>> data = spark.createDataFrame(
18+
... [Row(zip_code='94087', city='Sunnyvale'),
19+
... Row(zip_code='94088', city='Cupertino'),
20+
... Row(zip_code='95055', city='Santa Clara'),
21+
... Row(zip_code='95054', city='Palo Alto')])
22+
23+
>>>
24+
>>> data.show()
25+
+--------+-----------+
26+
|zip_code| city|
27+
+--------+-----------+
28+
| 94087| Sunnyvale|
29+
| 94088| Cupertino|
30+
| 95055|Santa Clara|
31+
| 95054| Palo Alto|
32+
+--------+-----------+
33+
34+
>>> from pyspark.sql.functions import udf
35+
>>> from pyspark.sql import types as T
36+
>>>
37+
>>> @udf(T.MapType(T.StringType(), T.StringType()))
38+
... def create_structure(zip_code, city):
39+
... return {zip_code: city}
40+
...
41+
>>> data.withColumn('structure', create_structure(data.zip_code, data.city)).toJSON().collect()
42+
[
43+
'{"zip_code":"94087","city":"Sunnyvale","structure":{"94087":"Sunnyvale"}}',
44+
'{"zip_code":"94088","city":"Cupertino","structure":{"94088":"Cupertino"}}',
45+
'{"zip_code":"95055","city":"Santa Clara","structure":{"95055":"Santa Clara"}}',
46+
'{"zip_code":"95054","city":"Palo Alto","structure":{"95054":"Palo Alto"}}'
47+
]
48+
49+
>>> data.withColumn('structure', create_structure(data.zip_code, data.city)).show(truncate=False)
50+
+--------+-----------+----------------------+
51+
|zip_code|city |structure |
52+
+--------+-----------+----------------------+
53+
|94087 |Sunnyvale |{94087 -> Sunnyvale} |
54+
|94088 |Cupertino |{94088 -> Cupertino} |
55+
|95055 |Santa Clara|{95055 -> Santa Clara}|
56+
|95054 |Palo Alto |{95054 -> Palo Alto} |
57+
+--------+-----------+----------------------+

tutorial/split-function/README.md

100644100755
File mode changed.

tutorial/top-N/top-N.txt

100644100755
File mode changed.

tutorial/wordcount/README.md

100644100755
File mode changed.

tutorial/wordcount/word_count.py

100644100755
File mode changed.

tutorial/wordcount/word_count_ver2.py

100644100755
File mode changed.

tutorial/wordcount/wordcount-shorthand.txt

100644100755
File mode changed.

tutorial/wordcount/wordcount.txt

100644100755
File mode changed.

0 commit comments

Comments
 (0)