Designing, coding and debugging Java Rich Client and Java EE Applications

Sunday, 23 March 2008

Paralellizing image filter operations, part 2

Continued from my last post, let's take a closer look at how the parallelizing of image filter operations is done in the demo.

How to parallelize image operations?

First for the general idea: The image is split up into two or four tiles. Then each of these tiles is processed by it's own thread. When all threads have finished execution the tiles
are joined together to form a whole image again.
This approach has two advantages:
Fist, there are no issues with resource locking, since each thread operates on it's own piece of image data.
And second, from the callers perspective the multithreading is completely transparent: A simple image is passed in and a simple image is returned.

When splitting up and joining the image we have to deal with the fact that some image filter operation like sharpening or blur (which I used for this demo) have a radius of operation.
This means, when the image is split up, this has to be taken into account as an overlap between the different tiles.
Also we must create a frame with the width of the radius around the image, so we still have something to calculate when we reach the edge.

Splitting an image up into four tiles looks like this:




Joining the tiles again is just some copy operation, taking care of the overlap.

Class design

I took the time to think a little bit about the structure of the demo application, instead of just slamming all of the code into a JPanel.
First, as I wrote earlier, Beans Binding and Swing Application Framework features where used to get a clean separation of view and model. I don't want to go into detail now concerning the view part, though.

For now we focus on the model part.

The image to filter, the resulting image and all the settings to use for this operation (such as how many threads to use and the radius) are encapsulated in a Domain Model style ImageOpModel class.
The filter operation is encapsulated by the Interface ImageOperation. That way the "operation" can be just about any kind of image operation that can be performed on a BufferedImage.
For this demo GaussianBlurImageOp is the only implementation. It performs, well of course, the Gaussian blur filter.
Now, ImageOpModel executes the ImageOperation directly if no multithreading is to be used. Otherwise it uses an implementation of ImageOpExecuter to perform the ImageOperation.
The point here was having the overhead associated with the multithreaded approach only in case that multiple threads are really used.
After all we want to play fair with the single threaded approach and retain it's advantage: Less overhead.
For ImageOpExecuter there is a single implementation: MultithreadedImageOpExec.
This class takes care of parallelizing the ImageOpertion into 2 or 4 threads. And that's pretty much what the whole demo was about in the first place.

So, the design of the model is straightforward:

Implementation
Let's take a look at the implementation.
First I'd like to go a little bit into the necessary utility (or rather boilerplate?) code of the demo, starting with the Domain Model style ImageOpModel.

Here are the attribute declarations of ImageOpModel:

public class ImageOpModel {
//Path to the image resource to load
private String imagePath;
//Running in threaded mode?
private boolean multithreaded=false;
//The number of threads to use in multithreaded mode
private int numThreads=2;
//The radius for the image operation, if any
private int radius=25;
//The source image to perform the operation on
private BufferedImage sourceImage=null;
//The finished, filtered image
private volatile BufferedImage processedImage=null;
//Should the processed image or the source image be visible in the view(s)
private volatile boolean processedImageReady=false;
//How long did the last image operation take?
private double lastProcessingTime=0;
//The image operation to perform on sourceImage resulting in
//processedImage
private ImageOperation imageOp;
//The Executer for the multithreaded variant
private ImageOpExecuter executer;


Ok, thanks to the unusual generous inline commentary the attributes should be more or less clear.
Multithreaded, numThreads and radius are coupled to the GUI Form via Beans Binding. The sourceImage gets initialized by a method called loadImage() which is as straightforward as the name suggests.
ImageOp and executer are initialized the old fashioned way in the constructor:

public ImageOpModel(String imagePath) {
this.imagePath=imagePath;
this.imageOp=new GausianBlurImageOp(radius);
this.executer=new MultithreadedImageOpExec();
}

This would have been a good point to try out Google Guice as well, but I left this to next version of the demo.
The interesting part is what happens when the user presses the "Execute Operation" button in the GUI. Through the magic of an Application Framework Action the method execOp() in our model class gets executed.
Inside execOp() the following is going on:

public void execOp(){
//Show the resulting image only when we are finished
this.processedImageReady=false;
//Keep track of the time we need to execute the operation
long tStart=System.nanoTime();
//Create a copy of the image with a "radius" wide border for the operation
this.processedImage=new BufferedImage(sourceImage.getWidth() + 2 * this.getRadius(),
sourceImage.getHeight() + 2 * this.getRadius(),
BufferedImage.TYPE_INT_ARGB);
Graphics2D g2 = this.processedImage.createGraphics();
g2.drawImage(
sourceImage, this.getRadius(), this.getRadius(), null);
//If we are running in multithreaded mode, we use the executer
if(this.isMultithreaded()){
this.executer.executeOperation(this.processedImage, this.getImageOp(),
this.getNumThreads(), this.getRadius());
}else{
//otherwise we just perform the operation directly
this.processedImage=this.imageOp.executeOperation(this.processedImage);
}
//We are finished. Showtime!
this.processedImageReady=true;
//The last thing to do: See how long it took.
long tDelta=System.nanoTime()-tStart;
this.lastProcessingTime=(tDelta/1000000000d);
}
So basically what we do here is executing the ImageOp directly if we don't want to do multithreading, or using the executer to do so for the multithreaded mode.
Around these calls we have some utility code to keep track of the time used for the operation and setting A flag to indicate if an processed image is available.

Implementation of the multithreading
Now for the real interesting part of the demo: And I have to say it looks obvious as well. But that's not always a bad thing, right?

Here's the class with it's only attribute:

public class MultithreadedImageOpExec implements ImageOpExecuter {
private static ExecutorService executor = Executors.newCachedThreadPool();
...
The executor is an ExecuterService which is a concept introduced with Java 5 to create a new level of abstraction for multithreading.
An ExecuterService, as the name suggests, is a service that executes a piece of code. That "piece of code" has to be wrapped by either a Runnable or Callable interface.
Runnable has been around since Java 1.0 and it contains just one simple method to implement: void run().
Callable however is a new addition since Java 5 and a lot more useful. It's parametrized with generics
(Callable )and has only one method as well: V call().
Okay, that doesn't look too spectacular, but it means, that a Callable can actually return a value in a (type)safe way.
We use the Factory style Executors class to obtain a CachedTreadPool. This very sophisticated implementation not only takes care of creating new threads as necessary, but also
caches the threads for reuse. Creating a new thread is an expansive operation, so caching makes a lot of sense. It's a neat thing that this taken care of entirely by the Executer
framework.

Here's the method that executes the ImageOperation using the executor:
    public BufferedImage executeOperation(final BufferedImage sourceImage,
final ImageOperation operation, final int numThreads, final int overlap) {

try {
//Split the source image in a list of tiles.
List tiles = splitIntoTiles(sourceImage, numThreads, overlap);
//Create a worker object that executes the operation for every tile
List workers = new ArrayList(numThreads);
//Intitialize the workers with their coresponding tiles
for (BufferedImage curTile : tiles) {
workers.add(new ImageOpWorker(curTile,operation));
}
//Let the executor framework invoke all workers and handle the
//task of coordination: When invokeAll returns, all Futures will
//have been processed
List<future> results=executor.invokeAll(workers);
//Clear the List of source tiles before reusing it for the processed tiles
tiles.clear();
//Transfer the processed tiles to the simple list of BufferedImages
//so we don't have to deal with the nature of Futures during joining
for(Future\ curFuture:results){
tiles.add(curFuture.get());
}
//rejoin the processed tiles to the sourceImage
joinTiles(tiles, sourceImage, overlap);
} catch (InterruptedException ex) {
..
} catch (ExecutionException eE) {
..
}
return sourceImage;
}

The ImageOperation and the image to process are passed into the method. First the image is split into 2 or 4 tiles by the method splitIntoTiles().
This method is actually quite ugly and long at the moment, so I won't show it here (You can check it out in the source code, though).
Next for every tile an instance of ImageOpWorker is constructed.
ImageOpWorker is an inner class implementing Callable:

    public class ImageOpWorker implements Callable {

private volatile BufferedImage sourceTileImage;
private final ImageOperation operation;

public ImageOpWorker(BufferedImage image, final ImageOperation operation) {
this.sourceTileImage = image;
this.operation = operation;
}

public BufferedImage call() throws Exception {
return this.operation.executeOperation(sourceTileImage);
}
}


As you can see it's really a thin wrapper around the code that's supposed to be executed be a separate thread.

Back to the executeOperation() Method:
Now that we constructed a list of ImageOpWorkers for all the tiles we want to process, we just call invokeAll() on the executer and
pass the list of
ImageOpWorkers. What the executer does for us now is the following:
  • For each ImageOpWorker it either creates a new thread or reuses an existing one
  • It executes the call() method on each worker and wraps the returned image into a Future
  • It returns only when all workers have terminated execution

This means that the executor not only caters for the creation of threads, but also does the thread synchronisation for us. Otherwise we
would have to use a CountDownLatch or something similar to wait for all threads to complete.

The returned Future Objects wrap computational results that, well, come from THE FUTURE! Unfortunately it doesn't have anything to
do with Michael J. Fox's flux compensator. Rather it's purpose is simply to return the result of an asynchronous computation through it's
get() method. The trick is, that get() will block until the asynchronous computation is finished and the result is available. In our case however we
already know that all our Futures are ready, anyway.
The next step is to pull the results out of the Future(s) and store the images in a simple list.
We do that because we want to keep the amount of multithreading aware code as small as possible. The joinTiles() method should not
need to deal with Futures. After joining the tiles again we return the manipulated image as a normal BufferedImage again and
we are done!


Remember to try the webstart version of the demo.
Also you can have a closer look at the sourcecode: MultithreadingDemoSrc.zip (1.49Mb)

Friday, 21 March 2008

Paralellizing image filter operations, part 1

On last December's Javapolis I attended the "filthy makeover" track by Chet Haase.
Chet showed some cool performance hacks for Swing, among them a nasty one for speeding up an image blur filter. While the trick he used was a cool way of cutting corners a little bit for performance sake, I kept on wondering:
Isn't image processing exactly the kind of computation that would hugely benefit from a multithreaded approach? And after all parallelizing such an operation should be fairly straight forward: Just split the image up into tiles and process each one independently. Plus: The new (well not that new anymore) concurrency features of Java 5 should make the job quite a bit easier.
I decided to try it out and developed a little demo in form of a Swing application.
For building this demo I used NetBeans 6. Unfortunately I couldn't resist using some the new features of NetBeans for building Swing Applications: I used
the new Swing Application Framework and Beans Binding. I found that these new frameworks around Swing and the support for them in NetBeans show great potential for making a Swing developers life easier. Even if there are still some rough edges. But I leave that for some other blog entry.
For the demo I used a Gaussian blur filter that I sto...hmm...cop...aah.. well that I got from a Filthy Rich Client example by Romain Guy. I sure hope he doesn't mind that I used his code snipped instead of digging it up myself. Anyway, the filter is applied to a really large image, so the benefits from the parallelizing will be clearly visible and the involved overhead plays less of a role.
Here's what the (for now) finished demo looks like:

The image used is a 3836x2560 pixel photo that a friend of mine took at Frankfurt airport. It is shown either in "scale to fit" view, or in a "1:1" view (the view can be switched by double clicking on it).
Also, the radius for blur operation can be changed through the GUI. The larger the radius is, the more the image gets blurred and the more time-consuming the operation becomes.
If the blur is applied with a 25 pixel radius the results are the following when viewed in the "1:1" zoomed in view:

Okay, but what about the performance now?
To be honest, I wasn't 100 percent sure if the multithreading would really boost the performance, considering the added overhead of the tiling.

So, I was delighted to see the following results on my 2.2 Ghz Core Duo 2 Windows XP Laptop:
MultithreadingRadius 10Radius 25Radius 50
None2.75 sec5.986 sec11.121 sec
2 Threads2.082 sec3.515 sec6.597 sec
4 Threads2.202 sec3.601 sec7.151 sec

While the operation benefits just a little for a small radius of 10 pixels, the performance boost is really there for a 25 and 50 pixels radius! Using 2 Threads the performance is 1.7 times higher than using no multithreading at all. A twofold increase in in performance would be the theoretical maximum. Considering that the approach with one thread avoids the overhead of tiling and creating threads all together, the 1.7 fold increase is not so bad, really.
Using 4 threads is slower than using 2 on a machine with 2 cores, but the performance hit seems not so severe.
So, what happens if I let the demo run on my new 2.33 Ghz quad core desktop machine?
Here are the results:
MultithreadingRadius 10Radius 25Radius 50
None2.561 sec5.706 sec10.829 sec
2 Threads1.624 sec3.17 sec5.826 sec
4 Threads1.42 sec2.652 sec4.786 sec


Damn! It is a little bit faster with 4 threads but not as much as I hoped for. I guess I have to investigate that a bit in profiling.
Oh, here are also some results from my old single core 1,5 Ghz Pentium M
laptop running Ubuntu 7.10:
MultithreadingRadius 10Radius 25Radius 50
None8.581 sec18.968 sec35.674 sec
2 Threads9.128 sec19.745 sec36.257 sec
4 Threads9.089 sec19.949 sec36.888 sec


If you want to try the demo out for yourself you can webstart it up directly from here!

The complete source code is available here: MultithreadingDemoSrc.zip (1.49Mb)
It's in the form of NetBeans 6 project.
The rather large download size is mostly due to the contained high resolution photo, that weights in at 1.33 Mb.

In the next post I'll go further into the details of the demo, posting some code fragments and a little bit of design background...