Skip to content

IncrementalUpdateFeature

vinayaugustine edited this page Oct 18, 2012 · 4 revisions

Definition

Feature Overview

The Incremental Update feature is a new srcML archive that allows for incremental update of a srcML archive. The two previous types of srcML archive are individual srcML files (where one srcML file represents one source file) and multi-file archives (where one srcML file represents multiple source files).

A srcML file that represents an individual XML file is structured like this:

<unit filename="myfile.c"><!-- srcML for myfile.c --></unit>

There is a single unit element that then contains the srcML version of the source code. This is primarily useful when you want to analyze a single file. Because it represents a single source file, it is very easy to re-generate when the file changes.

A multi-file archive is structured like this:

<unit>
    <unit filename="myfile.c"><!-- srcML for myfile.c --></unit>
    <unit filename="myheader.h"><!-- srcML for myheader.h --></unit>
</unit>

There is a single root unit that contains other sub-units. Each of these sub-units represents a complete source file. This structure is very useful for querying the project or doing a large-scale transformation. Because the "filename" attribute is stored with each unit, it is also very easy to write all of the source code out to disk.

A very common query when using the ABB.SrcML framework is to do a query like this:

var archive = new SrcMLFile("pathToXml");
var newOps = from fileUnit in archive.FileUnits
		     from op in fileUnit.Descendants(OP.Operator)
			 where op.Value == "new"
		     select op;

This iterates over all of the files in the archive and looks for instances of the new operator. This makes it easy to do searches or make changes to an entire development project instead of file-by-file.

The incrementally updating srcML archive combines these two features: single-file storage of srcML and iteration over an entire project. This srcML archive also responds to file updates (addition, deletion, modification).

Users & Use cases

Users

Clients that wish to use get up-to-date srcML representations for source code.

Use cases

Here are several use cases that inform the incremental update feature.

Sando

Sando is a Visual Studio plugin that updates its index whenever it detects that a source file has changed. The ABB.SrcML framework should be able to take over the monitoring of source code and the generation of srcML.

Once srcML is generated, Sando should be notified that a new srcML file is available so that it can update its index.

Directory Monitor

A service that monitors a directory for changes to source code files needs to be able to detect changes to source code in the directory and then generate srcML. This may be used by a 3rd-party text editor / IDE that does not have srcML integrated in it.

Experiments

Pat is developing a new tool on top of ABB.SrcML and would like to use the new srcML archive without the directory monitoring component. Pat would like to create an archive composed of individual srcML files, and run experiments on it as if it was a multi-file srcML archive. He does not expect the source code to change, and therefore doesn't need to monitor the files to see if they're updated.

Dependencies

This feature has no dependencies.

Design

The incrementally updating archive combines the best of both of these types. Individual source files are represented by individual srcML files. However, the srcML files are grouped together in a directory that allows code built on top of ABB.SrcML to iterate over them via the FileUnits property.

Project Interface

The project interface provides an interface between the representation of the "project" and the SrcMLArchive. The project interface must provide a few key functions:

GetListOfFiles

This function gets a list of files from the client. This list of files is what we are monitoring for changes.

In the simplest case, the list of files is just all of the source files in a directory. A more complicated case is a Visual Studio solution or project. In this case, we would parse the project file or query Visual Studio for the list of source files.

GetListOfFiles should always return the latest collection of files as reported by the client.

The client may cache the list of files for use in monitoring.

SourceChangedEvent

The project should raise this event whenever it detects a change to the source code being monitored. Changes include:

  • Creation
  • Deletion
  • Modification

StartMonitoring

This function tells project to start monitoring the list of source files for changes. It can do this either by:

  • Subscribing to events (for example: FileSystemWatcher)
  • Occasionally crawling the directory and comparing the contents to the contents of the archive directory

When a change is encountered, the project should raise the SourceChangedEvent.

StopMonitoring

When called, the project object should stop monitoring the list of source files. This means it should either unsubscribe to events it is listening to or it should stop crawling the directory.

After StopMonitoring is called, the SourceChangedEvent should no longer be raised.

The srcML archive class

The archive class is the primary interface for the incrementally updating archive. It supports common archive operations such as:

  • iterating over the source files
  • exporting the archive to source code
  • Getting information about the archive (root attributes, etc)

The primary feature of this class, though, is that it implements the IProject interface. This means that it can monitor a list of source files for changes. It also wraps an IProject object that is used to do the actual monitoring of the source code.

A common use of this is to do:

IProject directory = new DirectoryProject("/path/to/source/code");
SrcMLArchive archive = new SrcMLArchive(directory);
archive.StartMonitoring(); // causes directory.StartMonitoring() to execute

When archive is notified that there is a change, it creates/updates/removes the related srcML and then fires its own SourceChangedEvent

Storing the srcML on disk

Given the following directory tree:

+ myCppProject/
|-- main.cpp
|-- component/
|   |-- component.cpp
|   +-- component.h
+-- mainHeader.h

We should get the following structure in the archive directory:

+ .srcml
|-- <hash of main.cpp path>.xml
|-- <hash of component/component.cpp>.xml
|-- <hash of component/component.h>.xml
+-- <hash of mainHeader.h>.xml

There are a number of ways to hash the file paths. The key requirement here is that the "hash" be reversible. If the path to my source file is c:\path\to\me.cpp and the relevant path in the archive is c:\path\to\archive\hash_of_me.cpp, I should be able to convert the source path into the archive path and vice versa.

Some options for implementing this "hash" include:

  • Base64
  • Base32: Longer than Base64 but good for case-insensitive filesystems.

The reason for doing this is that development projects (such as Visual Studio projects) may include files from unrelated folders. Rather than recreate the entire path on disk in the archive directory, we encode each path using one of the above options.

Required Methods & Properties

Properties
  • Project: The archive is a wrapper around an IProject (called project, here). The archive monitors the source files in project
  • ArchivePath: The full path on disk to the directory where this archive stores its srcML files.
  • SourceChangedEvent (from IProject): SrcMLArchive fires SourceChangedEvent only after it has updated the related srcML file.
Methods
  • From IProject:
    • GetListOfFiles: Query Project for the list of files
    • StartMonitoring: Execute Project.StartMonitoring()
    • StopMonitoring: Execute Project.StopMonitoring()
  • GetXmlPathForSourcePath: Gets the full path to the srcML file for the given source file
  • GetSourcePathForXmlPath: Gets the full path to the source file for the given srcML file