@@ -8,112 +8,114 @@ A good starting point for new users is our [`WordCount`](https://github.com/Goog
88example, which runs over the provided input text file(s) and computes how many
99times each word occurs in the input.
1010
11- Besides WordCount, the following examples are included:
11+ Besides ` WordCount ` , the following examples are included:
1212
1313 <ul >
1414 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/AutoComplete.java " >AutoComplete</a >
15- &mdash ; An example that computes the most popular hash tags for a for every
15+ &mdash ; An example that computes the most popular hash tags for every
1616 prefix, which can be used for auto-completion. Demonstrates how to use the
1717 same pipeline in both streaming and batch, combiners, and composite
1818 transforms.</li >
1919 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/BigQueryTornadoes.java " >BigQueryTornadoes</a >
20- &mdash ; An example that reads the public samples of weather data from Google
20+ &mdash ; An example that reads the public samples of weather data from Google
2121 BigQuery, counts the number of tornadoes that occur in each month, and
2222 writes the results to BigQuery. Demonstrates reading/writing BigQuery,
2323 counting a <code >PCollection</code >, and user-defined <code >PTransforms</code >.</li >
2424 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/CombinePerKeyExamples.java " >CombinePerKeyExamples</a >
25- &mdash ; An example that reads the public " ; Shakespeare" ; data, and for
25+ &mdash ; An example that reads the public " ; Shakespeare" ; data, and for
2626 each word in the dataset that exceeds a given length, generates a string
2727 containing the list of play names in which that word appears. Output is saved
2828 in a Google BigQuery table. Demonstrates the <code >Combine.perKey</code >
2929 transform, which lets you combine the values in a key-grouped
30- <code >PCollection</code >; also how to use an <code >Aggregator</code > to track
30+ <code >PCollection</code >; also how to use an <code >Aggregator</code > to track
3131 information in the Google Developers Console.
3232 </li >
3333 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/DatastoreWordCount.java " >DatastoreWordCount</a >
34- &mdash ; An example that shows you how to use Google Cloud Datastore IO to read
35- from Cloud Datastore.</li >
34+ &mdash ; An example that shows you how to read from Google Cloud Datastore.</li >
3635 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/DeDupExample.java " >DeDupExample</a >
37- &mdash ; An example that uses Shakespeare's plays as plain text files, and
36+ &mdash ; An example that uses Shakespeare's plays as plain text files, and
3837 removes duplicate lines across all the files. Demonstrates the
3938 <code >RemoveDuplicates</code >, <code >TextIO.Read</code >,
40- <code >RemoveDuplicates</code >, and <code >TextIO.Write</code > transforms, and
41- how to wire transforms together.
39+ and <code >TextIO.Write</code > transforms, and how to wire transforms together.
4240 </li >
4341 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/FilterExamples.java " >FilterExamples</a >
44- &mdash ; An example that shows different approaches to filtering, including
42+ &mdash ; An example that shows different approaches to filtering, including
4543 selection and projection. It also shows how to dynamically set parameters
4644 by defining and using new pipeline options, and use how to use a value derived
4745 by a pipeline. Demonstrates the <code >Mean</code > transform,
4846 <code >Options</code > configuration, and using pipeline-derived data as a side
4947 input.
5048 </li >
5149 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/JoinExamples.java " >JoinExamples</a >
52- &mdash ; An example that shows how to do a join on two collections. It uses a
50+ &mdash ; An example that shows how to join two collections. It uses a
5351 sample of the <a href =" http://goo.gl/OB6oin " >GDELT " ; world event" ;
5452 data</a >, joining the event <code >action</code > country code against a table
55- that maps country codes to country names. Demonstrated the <code >Join</code >
53+ that maps country codes to country names. Demonstrates the <code >Join</code >
5654 operation, and using multiple input sources.
5755 </li >
58- <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/MaxPerKeyExamples.java " >MaxPerKeyExamples</a >&mdash ; An example that reads the public samples of weather data from BigQuery,
56+ <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/MaxPerKeyExamples.java " >MaxPerKeyExamples</a >
57+ &mdash ; An example that reads the public samples of weather data from BigQuery,
5958 and finds the maximum temperature (<code >mean_temp</code >) for each month.
60- Demonstates the <code >Max</code > statistical combination transform, and how to
59+ Demonstrates the <code >Max</code > statistical combination transform, and how to
6160 find the max-per-key group.
6261 </li >
63- <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/PubsubFileInjector.java " >PubsubFileInjector</a >&mdash ; A batch Cloud Dataflow pipeline for injecting a set of Cloud Storage
62+ <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/PubsubFileInjector.java " >PubsubFileInjector</a >
63+ &mdash ; A batch Cloud Dataflow pipeline for injecting a set of Cloud Storage
6464 files into a Google Cloud Pub/Sub topic, line by line. This example can be
6565 useful for testing streaming pipelines.
6666 </li >
67- <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/StreamingWordExtract.java " >StreamingWordExtract</a >&mdash ; An streaming pipeline example that inputs lines of text from a Cloud
67+ <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/StreamingWordExtract.java " >StreamingWordExtract</a >
68+ &mdash ; A streaming pipeline example that inputs lines of text from a Cloud
6869 Pub/Sub topic, splits each line into individual words, capitalizes those
6970 words, and writes the output to a BigQuery table.
7071 </li >
7172 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/TfIdf.java " >TfIdf</a >
72- &mdash ; An example that computes a basic TF-IDF search table for a directory or
73+ &mdash ; An example that computes a basic TF-IDF search table for a directory or
7374 Cloud Storage prefix. Demonstrates joining data, side inputs, and logging.
7475 </li >
7576 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/TopWikipediaSessions.java " >TopWikipediaSessions</a >
76- &mdash ; An example that reads Wikipedia edit data from Cloud Storage and
77+ &mdash ; An example that reads Wikipedia edit data from Cloud Storage and
7778 computes the user with the longest string of edits separated by no more than
7879 an hour within each month. Demonstrates using Cloud Dataflow
7980 <code >Windowing</code > to perform time-based aggregations of data.
8081 </li >
8182 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/TrafficMaxLaneFlow.java " >TrafficMaxLaneFlow</a >
82- &mdash ; A streaming Cloud Dataflow example using BigQuery output in the
83+ &mdash ; A streaming Cloud Dataflow example using BigQuery output in the
8384 <code >traffic sensor</code > domain. Demonstrates the Cloud Dataflow streaming
8485 runner, sliding windows, Cloud Pub/Sub topic ingestion, the use of the
8586 <code >AvroCoder</code > to encode a custom class, and custom
8687 <code >Combine</code > transforms.
8788 </li >
8889 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/TrafficRoutes.java " >TrafficRoutes</a >
89- &mdash ; A streaming Cloud Dataflow example using BigQuery output in the
90+ &mdash ; A streaming Cloud Dataflow example using BigQuery output in the
9091 <code >traffic sensor</code > domain. Demonstrates the Cloud Dataflow streaming
9192 runner, <code >GroupByKey</code >, keyed state, sliding windows, and Cloud
9293 Pub/Sub topic ingestion.
9394 </li >
9495 <li ><a href =" https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/WindowingWordCount.java " >WindowingWordCount</a >
95- &mdash ; An example that applies windowing to " ; Shakespeare" ; data in a
96- wordcount pipeline.
96+ &mdash ; An example that applies windowing to " ; Shakespeare" ; data in a
97+ ` WordCount ` pipeline.
9798 </li >
9899 </ul >
99100
100- ## How to Run the Examples
101+ ## Running the Examples
101102
102- After building and installing the Cloud Dataflow ` SDK ` and ` Examples ` modules as
103- explained in this [ README] ( https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/README.md ) ,
103+ After building and installing the ` SDK ` and ` Examples ` modules, as explained in this
104+ [ README] ( https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/README.md ) ,
104105you can execute the ` WordCount ` and other example pipelines using the
105106` DirectPipelineRunner ` on your local machine:
106107
107108 mvn exec:java -pl examples \
108109 -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
109110 -Dexec.args="--input=<INPUT FILE PATTERN> --output=<OUTPUT FILE>"
110111
111- If you have been whitelisted for Alpha access to the Cloud Dataflow Service and
112- followed the [ developer setup] ( https://cloud.google.com/dataflow/java-sdk/getting-started#DeveloperSetup )
113- steps, you can use the ` BlockingDataflowPipelineRunner ` to execute the
114- ` WordCount ` example in the Google Cloud Platform. In this case, you specify your
115- project name, pipeline runner, and the staging location in
116- [ Google Cloud Storage] ( https://cloud.google.com/storage/ ) , as follows:
112+ You can use the ` BlockingDataflowPipelineRunner ` to execute the ` WordCount ` example on
113+ Google Cloud Dataflow Service using managed resources in the Google Cloud Platform.
114+ Start by following the general Cloud Dataflow
115+ [ Getting Started] ( https://cloud.google.com/dataflow/getting-started ) instructions.
116+ You should have a Google Cloud Platform project that has a Cloud Dataflow API enabled,
117+ a Google Cloud Storage bucket that will serve as a staging location, and installed and
118+ authenticated Google Cloud SDK. In this case, invoke the example as follows:
117119
118120 mvn exec:java -pl examples \
119121 -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
@@ -123,9 +125,7 @@ project name, pipeline runner, and the staging location in
123125
124126Your Cloud Storage location should be entered in the form of
125127` gs://bucket/path/to/staging/directory ` . The Cloud Platform project refers to
126- its name (not number), which has been whitelisted for Cloud Dataflow. Refer to
127- [ Google Cloud Platform] ( https://cloud.google.com/ ) for general instructions on
128- getting started with Cloud Platform.
128+ its name (not number).
129129
130130Alternatively, you may choose to bundle all dependencies into a single JAR and
131131execute it outside of the Maven environment. For example, after building and
@@ -155,11 +155,7 @@ Note that when running Maven on Microsoft Windows platform, backslashes (`\`)
155155under the ` Dexec.args ` parameter should be escaped with another backslash. For
156156example, input file pattern of ` c:\*.txt ` should be entered as ` c:\\*.txt ` .
157157
158- <p class =" note " ><b >Note:</b > We are working on improving the experience around
159- running some of our streaming examples. Please stay tuned for much easier
160- instructions in the near future!</p >
161-
162- ### Running the "Traffic" Streaming Examples###
158+ ### Running the "Traffic" Streaming Examples
163159
164160The ` TrafficMaxLaneFlow ` and ` TrafficRoutes ` pipelines, when run in
165161streaming mode (with the ` --streaming=true ` option), require the
@@ -187,11 +183,11 @@ This file contains real traffic sensor data from San Diego freeways. See
187183<a href =" http://storage.googleapis.com/aju-sd-traffic/freeway_detector_config/Freeways-Metadata-2010_01_01/copyright(san%20diego).txt " >this file</a >
188184for copyright information.
189185
190- You may override the default ' --inputFile' with an alternative complete
186+ You may override the default ` --inputFile ` with an alternative complete
191187data set (~ 2GB). It is provided in the Google Cloud Storage bucket
192- ' gs://dataflow-samples/traffic_sensor/Freeways-5Minaa2010-01-01_to_2010-02-15.csv' .
188+ ` gs://dataflow-samples/traffic_sensor/Freeways-5Minaa2010-01-01_to_2010-02-15.csv ` .
193189
194- You may also set ' --inputFile' to an empty string, which will disable
190+ You may also set ` --inputFile ` to an empty string, which will disable
195191the automatic Pub/Sub injection, and allow you to use separate tool to control
196192the input to this example. An example code, which publishes traffic sensor data
197193to a Pub/Sub topic, is provided in [ ` traffic_pubsub_generator.py ` ] ( https://github.com/GoogleCloudPlatform/cloud-pubsub-samples-python/tree/master/gce-cmdline-publisher )
0 commit comments