Distributed Data Generation

The current scxmlpoc branch contains a rewrite of Data Generator to use SCXML as an input format. This will soon become "Data Generator 2.0". As part of the rewrite, we would also like to develop the capability to search our input models in parallel.

Design Approach

In 1.0, we used an in-memory breadth-first search to explore the model. This works efficiently for small models, but as models grow in complexity, this approach requires a great deal of memory.

In our current implementation of 2.0, we use a depth-first exploration instead. In this implementation, memory use is a function of model depth rather than overall complexity, which is much more efficient for more complex models. The depth-first approach on its own is difficult to parallelize evenly due to the use of backtracking.

In order to realize even better performance, we propose the use of a hybrid "breadth-first, then depth-first" exploration strategy. Under the hybrid approach, a depth-limited breadth-first search will be executed on the initial model, producing a collection of nodes N explored to the current depth. At that point, multiple depth-first searches will be performed in parallel, one for each node in N.

Current Implementation

Currently, ChartExec executes a depth-first search in memory, passing in a queue for results. The search produces a Map<String, String> to the queue for each data set found in the model. As the search is running, a separate thread polls the queue for results. For each result found, this thread calls consume(...), where the implementation of consume(...) depends on the DataConsumer implementation chosen by the user.

Proposed Design

We require that our parallel DFS design support execution on various architectures.

For example, if a user is running Data Generator on a PC, he/she should be able to use simple Java threads to run in parallel. Alternatively, for large problems, Data Generator needs to support distributed architectures (e.g., HDFS) for parallelization.

The work for an abstract "parallel worker" is: DG gives you a PossibleState, initial params, and Queue. You produce data to this Queue in the form of variable assignments, Map<String, String>.

Question: Should producing output stay decoupled from search?

Question: Should ChartExec maintain control of the process for producing output?

Question: If ChartExec maintains control, should it use more than one thread?

Question: If ChartExec does not maintain control, who does?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed Data Generation

Design Approach

Current Implementation

Proposed Design

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally