rss

Exploring other concurrency models

6

Category : C#, English, Google Summer of Code 2008, Mono, Programming

So, as you know ParallelFx helps programmers getting their software working nicely with multi core systems. To that end, it gives you several way to parallelize your code.

Nowadays in ParallelFx

The most simple and barebone approach is the Task API which is more or less a transcription of the standard .NET threading API (with some cool twist like continuations). Contrary to the ThreadPool or standard Thread, Tasks try to alleviate the work of the OS scheduler by using the ideal amount of workers. Thanks to that, thread context switches are avoided and it allows to get the more out of your processor cores in term of parallelism.

Closely building on that part, you have another nice abstraction called futures which let you think in term of computations rather than acting on results. Indeed, instead of directly manipulating results (and incidentally being bound by their computing time) you are encouraged to express how you would chain up result manipulation together and actually let the thing computes itself when it is really needed. That way, it allows everything to be done in background and be returned when ready.

Then, there are the Parallel class constructs that mimics classical imperative loops among which you find For, ForEach, While, etc. This way, imperative programmers feels at home and can get their loops parallelized easily (supposing there are no nasty side-effects and unprotected shared state manipulations of course).

Finally, PLinq allows query fans to get their Linq expressions running in parallel. Again, provided that there are the same conditions of thread-safety, this layer allows quick migration of existing code (basically some s/Enumerable/ParallelEnumerable/ and the addition of AsParallel() operator).

The Big Picture

The realization that starts to arise from the current work done on parallel system (especially those applied in imperative language) is that shared state is evil.

The problem with shared-state is that it inherently hurt down parallel performances. When using lock for instance, you prevent your code from scaling when the number of processor core increase as you waste CPU cycle to spin waiting for the locked resource to be freed (locked operations are also flawed in the sense that they can’t be composed together but that’s another subject).

Atomic primitives, like CAS, aren’t a panacea either as they perturb the caching mechanism of the CPU (forcing cores to flush their internal cache and go back fetching the new values) thus bringing poorer performances.

All in one, there is no miracle solution to this problem. It’s a consensus that we still have to maintain a certain amount of sharing mechanism cause doing a completely « pure » software with no side effects or mutation is either awkward or abnormal for the usual imperative programmer.

In fact there is an existing « solution » but it’s rather extreme : isolation. In a perfect world, all your parallel task would work on their stuff in a completely isolated fashion with walls around the task to prevent any kind of cross-thread access. That way no lock, no cache messing, etc can get in the way. Utopia he ?

Today in Mono.Threading.Extensions

Inheriting from the previous conclusions, there is an interest today in creating hybrid systems where isolation is the default while still allowing from time to time communication between each isolation domains.

At the moment, the two most popular in my sense are actors (popularized by Erlang) and software-transactional-memory.

The former rely on message passing between independent object to achieve concurrency while the latter works like a MCC database where each thread acts on its own copy of objects with the changes going back later on in the shared memory and in a transactional manner.

Brainstorming on these ideas, I implemented really simple version of the two mentioned systems, both relying on existing ParallelFx components.

Actors

My actor implementation uses the Task API and the existing work-stealing scheduler. To allow some sort of preemptive scheduling (and thus avoid deadlock when actors are waiting for message from another actor) some operation are abstracted through combinator methods like Loop, LoopWhile, AndThen, etc…

Following is an example of a process ring using 10 000 actors each passing around a message token to its neighbor seven times :

using System;
using System.Threading;
using Mono.Threading.Actors;

namespace ActorTest
{
  class Process
  {
	IActor actor;
	int round = 0;
	int id = sharedId++;

	const int MaxRound = 7;
	static int sharedId = 0;

	public Process(Func<process> next)
	{
	  actor = Actor.Create (() => {
		  Combinators.Loop (() => {
			  if (actor.Receive ((m) => next().actor.Send(null, m.Message))) {
				round++;
				if (id % 10 == 0)
				  Console.WriteLine(id.ToString() + " : " + round.ToString ());
			  }
			  if (round > MaxRound) {
				Console.WriteLine("Finish");
				Environment.Exit (0);
			  }
			});
		});
	}

	public void Start()
	{
	  Console.WriteLine("Starting");
	  actor.Send(null, MessageArgs.Empty);
	}
  }

  public class MainClass
  {
	static Process mainProc;

	public static void Main()
	{
	  InitProcess(0);
	  mainProc.Start();
	  Thread.Sleep(50000);
	}

	static Process InitProcess(int count)
	{
	  if (count == 10000)
		return new Process(() => mainProc);
	  if (count > 0) {
		Process temp = InitProcess(count + 1);
		return new Process(() => temp);
	  }

	  Process tmp = InitProcess(count + 1);
	  mainProc = new Process(() => tmp);
	  return null;
	}
  }
}

While there are 10 000 actors here, only two threads are doing the actual processing on my dual-core system.

STM

The software-transactional-memory implementation manage copies of the shared objects that are passed to the transaction and later on uses MCas (multi compare-and-swap) to update the shared locations. MCas is itself based on an almost wait-free algorithm where threads try to cooperate rather than wait idly.

Again following is an example of transaction showing the traditional bank account problem with concurrent deposit/withdraw operations :

using System;
using System.Threading;
using System.Threading.Tasks;

using Mono.Threading.Transactions;

namespace StmTests
{
  public class StmTest
  {
	const int NbTimes = 100;

	class BankAccount : ICloneable
	{
	  int amount;

	  public BankAccount (int amount)
		{
		  this.amount = amount;
		}

	  public int Amount {
		get {
		  return amount;
		} set {
		  amount = value;
		}
	  }

	  public object Clone ()
	  {
		return new BankAccount(amount);
	  }
	}

	StmObject<bankAccount> amount =
	  new StmObject<bankAccount> (new BankAccount (100));

	public void Withdrawn (int number)
	{
	  Transaction.Create(OpeningMode.Write, amount, (acc) => {
		  acc.Amount -= number;
		}).Execute();
	  // Simulate activity
	  Thread.Sleep(0);
	  Console.WriteLine("Removed {0}€", number);
	}

	public void Deposit (int number)
	{
	  Transaction.Create(OpeningMode.Write, amount, (acc) => {
		  acc.Amount += number;
		}).Execute();
	  // Simulate activity
	  Thread.Sleep(0);
	  Console.WriteLine("Added {0}€", number);
	}

	public int Amount {
	  get {
		return amount.Value.Amount;
	  }
	}

	public static void Main ()
	{
	  StmTest test = new StmTest();

	  Console.WriteLine ("Base amount : " + test.Amount);

	  Task t1 = Task.StartNew(_ => Repeat ((o) => test.Withdrawn (5)));
	  Task t2 = Task.StartNew(_ => Repeat ((o) => test.Deposit (5)));

	  Task.WaitAll (t1, t2);

	  Console.WriteLine ("Final amount : " + test.Amount);
	}

	static void Repeat (Action<object> action)
	{
	  for (int i = 0; i < NbTimes; i++) {
		Task.StartNew (action);
	  }
	}
  }
}

The Future

The parallel team at Microsoft is already working some of the stuff I described above for inclusion in ParallelFx. The STM group now has its own blog and there are experimentations going on in concurrency-friendly DSL.

You may find yet more exciting stuff of what is brainstormed in this video of Joe Duffy and Erik Meijer.

Conclusion

Thoughts ? Comments ? Use cases sharing ? You can go wild in the comments ;-) .

February previsions

Category : C#, English, General, Google Summer of Code 2008, Life, Mono, Programming, School

Mono

Thanks to the change of school semester at my school, I’m going to have a break in February of almost 4 weeks during which we are supposed to participate to some activities chosen among a list.

The cool thing is that I managed to make « contributing to Mono » one of those activities. Thus, I will have 4 weeks to work on Mono and integrating SoC’s ParallelFx work into the mainline tree with the blessing of my school. Yay :-) !

FOSDEM

There are also a bunch of other nice things I will be able to do this February. Going to FOSDEM is one of those (I took the train tickets this afternoon) which mean I can now wear the pretty badge too :

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

Unless we have trains problems on the way, we should be there for the Fosdem’s Beer Event too, so if you want to discuss Mono, ParallelFx or whatever around an (excellent) Belgian beer be sure to come by ;-) .

Prologin

Finally, I was qualified to go into the semi-finals of Prologin, the French algorithmic contest for young people, this year again. It will be a good occasion to see old faces.

For the first time there are also proposing C# as a qualification language. Hopefully, it will mean no on-the-fly C++ learning for me this time :-) .

Review time !

Category : C#, English, Google Summer of Code 2008, Mono, Programming

Now that Google Summer of Code is drawing to an end, it’s time to summarize what has been achieved during these two months. Before anything else I would to thank again everybody (and most noticeably the folks from Mono and Google) for this summer, it has been awesome :) !

Base API

First of all let’s see how we compare to Microsoft implementation with the output of corcompare. As you can see, a lot of the reported missings come from boilerplate attributes and such, so we are pretty much feature complete (except for PLinq).

In addition, I showed throughout this blog different examples coming from Microsoft CTP releases and running on our implementation. Each one showcased a different part of the available API with every time a good performance boost. In order we have parallelized ray tracer, various samples and PLinq stress test.

As remarked, PLinq lags a little behind and will requires a good deal of love to complete operators and get some internal reworking until it’s satisfactory.

Extensions

Even though most of the additions are still draft, we already have some interesting bits.

System.Threading namespace got two new classes : AtomicBoolean and InterlockedExtensions. AtomicBoolean is a simple wrapper that allows the same sort of operations provided by Interlocked.Exchange and Interlocked.CompareAndExchange but on a boolean type.

InterlockedExtensions only contains a work-in-progress MCAS instruction at the moment. MCAS (Multiple Compare-and-Swap) is a handy primitive that can be used in a lot of situation. It has a semantic close to the traditional Compare-and-swap instruction (Interlocked.CompareExchange on .NET) except that it works with multiple target. Think of it as a series of CAS executed atomically and failing if any of the CAS failed. A naive implementation could be :

public static bool MCas<t>(T[] locations, T[] expected, T[] newValues)
{
    for (int i = 0; i < locations.Length; i++)
        if (locations[i] != expected[i])
            return false;
    for (int i = 0; i < locations.Length; i++)
        locations[i] = newValues[i];

    return true;
}

It’s particularly useful with linked-list like data structures that maintain more than one reference to nodes.

In System.Threading.Collections we also have a first draft of a concurrent skip-list that can be used both as a standard list (like List) and as a hash map (like Dictionary). Skip list are very interesting structures that possess a lot of the advantages of balanced trees but with much less complexity. The current version uses fine-grained locking in some place but I plan to use the aforementioned MCAS primitive to remove them completely.

Finally we have the new System.Threading.Actors namespace which is also much like a playground to my experimentations for the moment. It will allows actors-based concurrency with .NET, a feature that is already present in concurrent languages like Erlang and Java-ish languages like Scala and Clojure. For the moment the model is close to a teared-down version of the one used in Scala but I have to look into Clojure’s model which seems lighter and doesn’t require pattern matching to be usable.

I still have more ideas that could be useful for concurrent programming like more lock-free data structures and a STM manager. STM, which stands for Software Transactional Memory, is a concept close to the transaction model you have in database for example. The principle is that, when you want to do a set of operations concurrently on shared data, you encapsulate them in a transaction where read/write are logged. If, in the middle of your operations, the data is changed by another thread you are able to ‘rollback’ the transaction and try again. If no thread got in the way the modification can be ‘committed’ to the shared data and thus be visible by other threads. In practice it should be possible to adopt a helping scheme where threads that are colliding with an existing transaction will help it finish before doing their own work.

In conclusion, concurrency is becoming a first-class citizens these days and just cannot be ignored anymore by programmers. Big actors like Microsoft are already pushing for that with ParallelFx and, probably, future evolution of C# itself (C#4 ?) so we mustn’t lag behind ;) .

Call for ParallelFx test

4

Category : C#, English, Google Summer of Code 2008, Mono, Programming, Quick note

Now that Summer of Code deadline is getting nearer, it’s a good time to give a try to ParallelFx. For that purpose I just issued updated test packages @ http://netherilshade.free.fr/mono/parallelfx/. Among them you will find stuff I already showed here like the raytracer, sudoku, or image colorizer.

Each of these package contains the most recent version of the System.Threading assembly (so no need to compile yourself trunk) and a README file explaining succinctly what the program do and how to launch it. In addition, feel free to test the library against other existing code making use of ParallelFx.

If you encounter any bug or bad performance, please send me an email at jeremie [DOT] laval [AT] gmail [DOT] com with your processor info (for Linux users the content of /proc/cpuinfo will do the trick), description of the problem and debug stack trace/reproducible test case (if any). Also, try to make the mail’s subject self-explanatory by adding a [Pfx] tag for example. I will triage them when I get back from holidays.

Happy testing :-) .

Funny parallelism

4

Category : C#, English, Google Summer of Code 2008, Linux, Mono, Programming

For those who have some free leisure time this weekend and would like to explore a bit ParallelFx in a didactic manner I ported to Linux a Sudoku game coming from ParallelFx CTP samples which optionally uses ParallelFx to generate Sudoku grids.

For screenshot fans that’s how it looks like :

Sudoku main interface

Sudoku main interface

You can enable the use of ParallelFx via the two options « Use multiple processors to generate puzzles » and « Use speculative puzzle generation » found in the Game settings box :

Game settings window

Game settings window

The first one uses Parallel.For instead of a standard for to generate grids and the other allows the background generation of further grids using Future.

LINQ Raytracer but with a P twist

1

Category : C#, English, Google Summer of Code 2008, Mono, Programming

Following Marek’s annoucement about Mono’s compiler getting full C# 3.0 support and his proof-of-concept which none other than Luke Hoban LINQ-ified Ray tracer, I’m happy to say that by using Mono’s PLINQ and with as little modifications as this :

--- linq-sequential.cs	2008-07-25 11:38:53.000000000 +0200
+++ plinq-parallel.cs	2008-07-25 11:39:04.000000000 +0200
@@ -1,9 +1,9 @@
-internal void RenderSequential(Scene scene)
+internal void RenderParallel(Scene scene)
 {
     int[] rgb = new int[screenWidth * screenHeight];

     var pixelsQuery =
-        from y in Enumerable.Range(0, screenHeight)
+        from y in ParallelEnumerable.Range(0, screenHeight)
         let recenterY = -(y - (screenHeight / 2.0)) / (2.0 * screenHeight)
         select from x in Enumerable.Range(0, screenWidth)
                let recenterX = (x - (screenWidth / 2.0)) / (2.0 * screenWidth)
@@ -65,14 +65,14 @@
                select new { X = x, Y = y, Color = traceRay(new TraceRayArgs(ray, scene, 0)) };

     int rowsProcessed = 0;
-    foreach (var row in pixelsQuery)
+    pixelsQuery.ForAll(row =>
     {
         foreach (var pixel in row)
         {
             rgb[pixel.X + (pixel.Y * screenWidth)] = pixel.Color.ToInt32();
         }
-        rowsProcessed++;
-        if (rowsProcessed % rowsPerUpdate == 0 ||
-            rowsProcessed >= screenHeight) updateImageHandler(rgb);
-    }
+        int processed = Interlocked.Increment(ref rowsProcessed);
+        if (processed % rowsPerUpdate == 0 ||
+            processed >= screenHeight) updateImageHandler(rgb);
+    });
 }

… the code runs approximately 1,5x faster on my dual-core computer. Actually it’s rather good although there is still room for improvement (OrderBy fallbacks to the traditional LINQ operator for instance). The Ray tracer here is particularly well suited to parallelization because calling MoveNext() involves a good deal of computation with no-constant time as it depends on how much reflection there is, that’s why the processing feels slower on the bottom of the picture.

Following are two screencasts of both sequential and parallel version of the Ray Tracer found in Microsoft ParallelFx CTP samples :

Actually colors are much more good-looking than this, but since GIF format supports 256 colors at most the output is a little fuzzy.

Now off to some multi-threading debug.

Current PLinq : mixed results

Category : C#, English, Google Summer of Code 2008, Mono, Programming

As described in the roadmap, PLinq implementation work has started a week ago. Some good bits are already there like what I consider to be the 3 most important operators (because most of the others can be expressed using these ones) : Select, Where and Aggregate. Following is a discussion on the specificities of PLinq, difficulties they raise and, at the end, some examples/benchmarks.

A thing to be distinguished in PLinq contrary to standard Linq is the way data is processed. With Linq you can pretty much assume that data is processed in order. For example in Mono, Select is implemented like this internally :

static IEnumerable CreateSelectIterator (IEnumerable source, Func selector)
{
	foreach (var element in source)
		yield return selector (element);
}

So if the IEnumerable‘s iterator is built to return data in order (like most standard collections in .NET) this order will be kept. More importantly, some Linq’s operators would lose their purpose if data wasn’t kept in order. Operators such as ElementAt, Skip, First are dependent on indexing and thus can’t work reliably on scrambled data.

Now the problem is that PLinq doesn’t follow this intuitive ordering property and that for a simple reason : performances. Indeed, just like Parallel.For split the domain to parallelize work, PLinq’s queries are run on workers which just take randomly data from the source enumerable and return them back without reordering them (or, put in another way, ordered by which took the less processing time). By doing things like this we can yield optimum performances. On the other hand, most applications rely somehow on data ordering, be it implicitly or with operators like OrderBy, which mean we can’t just ignore that aspect of Linq.

The current way PLinq allows this ordering part is to have two separate extension methods for turning IEnumerable into IParallelEnumerable. The first, AsParallel(), allows the engine to process the data as it wants, so a developer can’t expect in which order its initial data will be given back to him. The second one, AsOrdered(), enforces strict order preservation at each step of the processing and, in fact, allows the class of operators I cited before to work as expected. However if ordering is explicitly turned on, either by AsOrdered() or OrderBy operator, the performance penalty can be important, killing the interest to parallelize queries in the first place (which is why it’s turned off by default).

At the moment only the first mode (unordered processing) is supported in Mono’s PLinq. Adding support for the other mode will be quite challenging with operators like Where because they tend to create gaps in the source enumerable, gaps which are very hard to track in a non-deterministic environment like ParallelFx. I still have no clear idea on how to handle this in Mono’s ParallelFx without too much locking (if anyone has pointers go ahead ;) ) .

Finally to close this post on a lighter touch, I have some examples which works on current implementation. First of all simple query like this one :

ParallelEnumerable.Range(1, 2000)
	.Where(i => i % 10 == 0)
	.Select(i => i + 1)

Yield around 1.5x performance improvement when run on my dual-core system. The actual precise number vary importantly on external things like how much data is queried, how long an operator takes to process one item, how selective are the predicates used …

A second example which yield very little improvement at the moment (an example of why PLinq isn’t a panacea) is this sort-of adaptive image blurrer. It takes a pixel and filter the others according to how much close their color are to the selected pixel’s one. If a pixel doesn’t comply to this predicate it gets blurred with a bloom-like effect :


(In the screenshot blue and white colors are supposed to be more blurred and thus fade in the background)

The parallelized version of this program is actually slower by a little amount than its synchronous counterpart on Linux but a little quicker on Windows for about the same run-time of the synchronous version on both platforms.

Hello Monologue

1

Category : English, General, Google Summer of Code 2008, Mono

After much RSS problems (and there are still some in fact), I’m now on Monologue. Thanks to Miguel and Michael for adding me :) .

Now for the usual presentation : my name is Jérémie Laval, I’m a french student participating in this year edition of Google Summer of Code for the Mono Project. The detail of my submission can be found in an older blog post.

To summarize, I’m doing an open-source implementation of ParallelFx, a framework designed to ease the development of parallel application on CLR platform. You can see the current results of my work here and here.

Apart from Mono and Summer of Code, I’m also working on two pet projects : D-Bus Explorer and Circ.

Finally, for more general informations, you can take a look at my « About » page.

(Late) Zurich followup

Category : English, Google Summer of Code 2008, Mono, Programming

So, as I said previously, I went to Google Zurich last Thursday. It was really fun, I indeed met a cool bunch of people and GSoC colleagues. It was really interesting to discuss with Petr (gda2tiles), Peter (Mercurial mentor), Arthur (aptitude-gtk) and the physicist guy whose name I forgot (so sorry :P ). By the way, thanks a ton to him for the laptop lend ;) .

My presentation went smoothly and people seemed interested (you can’t really tell for sure :) ). At least it sparkled some interesting questions. If you want the slides here they are : ParallelFx presentation. The other talks were both diverse and interesting too.

Of course, Google in itself was a blast. Unfortunately I had to catch the train so I wasn’t able to visit the whole building like the others but, even with the short glimpse I had, I’m pretty sure it must be fun to work at Google ;) . Still, I managed to retrieve a nice-looking Google tee-shirt. It’s rather funny to see that GSoC is the way to go if you want to freshen up your clothes stock ;) .

More parallel benchmarking

1

Category : C#, English, Google Summer of Code 2008, Mono, Programming

This weekend I played with the Microsoft ParallelFx CTP’s samples, adapting and running them on current Mono ParallelFx implementation. I’m going to show you two of them and the results. I used the same computer than previously for running the samples.

Image Colorizer

The first one is an image colorizer. For the test, I used a .jpg file weighting 8,3 Mio and containing 13273085 pixels (4207×3155). The result of the processing is below :

For the same color region, the parallel-enabled processing was 1.77x faster than the synchronous one (parallel time : 4.3368363s, non-parallel time : 7.6746839s).

As can be seen on the two following cpu graphs, the parallel version take advantages of both core in my CPU :

(Parallel version running)


(Non-parallel version running)

Benchmark sample

The second sample is a benchmark program containing a bunch of test. Each test uses one of the available API in ParallelFx. For the run, I enabled the 3 following benchmarks :

  • Matrix multiplication : two random-filled matrix are multiplied together. Both are 500 by 500 matrix.
  • Tree sum : all nodes of a binary-tree are added together. I set tree depth to 24 and each node’s value was randomly chosen.
  • Searching : a random word is searched among all files present in a directory. I used a novel with each chapter split in a distinct .html file. The content of the directory was duplicated so that its total size was 12,6 Mio.

Each of these tests were run 7 times by the program and the time given back was the average of each run with the two extreme values being removed. Following are the results with, in parentheses, the ParallelFx API used by the test.

  • Matrix multiplication : 2,25x speedup (Parallel.For).
  • Tree sum : 1,58x speedup (Future).
  • Searching : 1,37x speedup (Parallel.ForEach).

Conclusion

So we can see that ParallelFx always provide a boost but that the overall improvement seems to depend on the component used and how it is used. It would be interesting to compare these results with the ones yielded by Microsoft implementation and by the Mono implementation but running on .NET. I will probably do that at the end of GSoC as a conclusion of the work put during the summer ;) .

In addition these samples showed some bugs. Indeed, I had to disable two tests in the benchmark sample because one of them returned different results between parallel/non-parallel version and the other was causing a stack overflow in Mono runtime (I still need to find out if it’s Mono fault or mine).