rss

Exploring other concurrency models

6

Category : C#, English, Google Summer of Code 2008, Mono, Programming

So, as you know ParallelFx helps programmers getting their software working nicely with multi core systems. To that end, it gives you several way to parallelize your code.

Nowadays in ParallelFx

The most simple and barebone approach is the Task API which is more or less a transcription of the standard .NET threading API (with some cool twist like continuations). Contrary to the ThreadPool or standard Thread, Tasks try to alleviate the work of the OS scheduler by using the ideal amount of workers. Thanks to that, thread context switches are avoided and it allows to get the more out of your processor cores in term of parallelism.

Closely building on that part, you have another nice abstraction called futures which let you think in term of computations rather than acting on results. Indeed, instead of directly manipulating results (and incidentally being bound by their computing time) you are encouraged to express how you would chain up result manipulation together and actually let the thing computes itself when it is really needed. That way, it allows everything to be done in background and be returned when ready.

Then, there are the Parallel class constructs that mimics classical imperative loops among which you find For, ForEach, While, etc. This way, imperative programmers feels at home and can get their loops parallelized easily (supposing there are no nasty side-effects and unprotected shared state manipulations of course).

Finally, PLinq allows query fans to get their Linq expressions running in parallel. Again, provided that there are the same conditions of thread-safety, this layer allows quick migration of existing code (basically some s/Enumerable/ParallelEnumerable/ and the addition of AsParallel() operator).

The Big Picture

The realization that starts to arise from the current work done on parallel system (especially those applied in imperative language) is that shared state is evil.

The problem with shared-state is that it inherently hurt down parallel performances. When using lock for instance, you prevent your code from scaling when the number of processor core increase as you waste CPU cycle to spin waiting for the locked resource to be freed (locked operations are also flawed in the sense that they can’t be composed together but that’s another subject).

Atomic primitives, like CAS, aren’t a panacea either as they perturb the caching mechanism of the CPU (forcing cores to flush their internal cache and go back fetching the new values) thus bringing poorer performances.

All in one, there is no miracle solution to this problem. It’s a consensus that we still have to maintain a certain amount of sharing mechanism cause doing a completely « pure » software with no side effects or mutation is either awkward or abnormal for the usual imperative programmer.

In fact there is an existing « solution » but it’s rather extreme : isolation. In a perfect world, all your parallel task would work on their stuff in a completely isolated fashion with walls around the task to prevent any kind of cross-thread access. That way no lock, no cache messing, etc can get in the way. Utopia he ?

Today in Mono.Threading.Extensions

Inheriting from the previous conclusions, there is an interest today in creating hybrid systems where isolation is the default while still allowing from time to time communication between each isolation domains.

At the moment, the two most popular in my sense are actors (popularized by Erlang) and software-transactional-memory.

The former rely on message passing between independent object to achieve concurrency while the latter works like a MCC database where each thread acts on its own copy of objects with the changes going back later on in the shared memory and in a transactional manner.

Brainstorming on these ideas, I implemented really simple version of the two mentioned systems, both relying on existing ParallelFx components.

Actors

My actor implementation uses the Task API and the existing work-stealing scheduler. To allow some sort of preemptive scheduling (and thus avoid deadlock when actors are waiting for message from another actor) some operation are abstracted through combinator methods like Loop, LoopWhile, AndThen, etc…

Following is an example of a process ring using 10 000 actors each passing around a message token to its neighbor seven times :

using System;
using System.Threading;
using Mono.Threading.Actors;

namespace ActorTest
{
  class Process
  {
	IActor actor;
	int round = 0;
	int id = sharedId++;

	const int MaxRound = 7;
	static int sharedId = 0;

	public Process(Func<process> next)
	{
	  actor = Actor.Create (() => {
		  Combinators.Loop (() => {
			  if (actor.Receive ((m) => next().actor.Send(null, m.Message))) {
				round++;
				if (id % 10 == 0)
				  Console.WriteLine(id.ToString() + " : " + round.ToString ());
			  }
			  if (round > MaxRound) {
				Console.WriteLine("Finish");
				Environment.Exit (0);
			  }
			});
		});
	}

	public void Start()
	{
	  Console.WriteLine("Starting");
	  actor.Send(null, MessageArgs.Empty);
	}
  }

  public class MainClass
  {
	static Process mainProc;

	public static void Main()
	{
	  InitProcess(0);
	  mainProc.Start();
	  Thread.Sleep(50000);
	}

	static Process InitProcess(int count)
	{
	  if (count == 10000)
		return new Process(() => mainProc);
	  if (count > 0) {
		Process temp = InitProcess(count + 1);
		return new Process(() => temp);
	  }

	  Process tmp = InitProcess(count + 1);
	  mainProc = new Process(() => tmp);
	  return null;
	}
  }
}

While there are 10 000 actors here, only two threads are doing the actual processing on my dual-core system.

STM

The software-transactional-memory implementation manage copies of the shared objects that are passed to the transaction and later on uses MCas (multi compare-and-swap) to update the shared locations. MCas is itself based on an almost wait-free algorithm where threads try to cooperate rather than wait idly.

Again following is an example of transaction showing the traditional bank account problem with concurrent deposit/withdraw operations :

using System;
using System.Threading;
using System.Threading.Tasks;

using Mono.Threading.Transactions;

namespace StmTests
{
  public class StmTest
  {
	const int NbTimes = 100;

	class BankAccount : ICloneable
	{
	  int amount;

	  public BankAccount (int amount)
		{
		  this.amount = amount;
		}

	  public int Amount {
		get {
		  return amount;
		} set {
		  amount = value;
		}
	  }

	  public object Clone ()
	  {
		return new BankAccount(amount);
	  }
	}

	StmObject<bankAccount> amount =
	  new StmObject<bankAccount> (new BankAccount (100));

	public void Withdrawn (int number)
	{
	  Transaction.Create(OpeningMode.Write, amount, (acc) => {
		  acc.Amount -= number;
		}).Execute();
	  // Simulate activity
	  Thread.Sleep(0);
	  Console.WriteLine("Removed {0}€", number);
	}

	public void Deposit (int number)
	{
	  Transaction.Create(OpeningMode.Write, amount, (acc) => {
		  acc.Amount += number;
		}).Execute();
	  // Simulate activity
	  Thread.Sleep(0);
	  Console.WriteLine("Added {0}€", number);
	}

	public int Amount {
	  get {
		return amount.Value.Amount;
	  }
	}

	public static void Main ()
	{
	  StmTest test = new StmTest();

	  Console.WriteLine ("Base amount : " + test.Amount);

	  Task t1 = Task.StartNew(_ => Repeat ((o) => test.Withdrawn (5)));
	  Task t2 = Task.StartNew(_ => Repeat ((o) => test.Deposit (5)));

	  Task.WaitAll (t1, t2);

	  Console.WriteLine ("Final amount : " + test.Amount);
	}

	static void Repeat (Action<object> action)
	{
	  for (int i = 0; i < NbTimes; i++) {
		Task.StartNew (action);
	  }
	}
  }
}

The Future

The parallel team at Microsoft is already working some of the stuff I described above for inclusion in ParallelFx. The STM group now has its own blog and there are experimentations going on in concurrency-friendly DSL.

You may find yet more exciting stuff of what is brainstormed in this video of Joe Duffy and Erik Meijer.

Conclusion

Thoughts ? Comments ? Use cases sharing ? You can go wild in the comments ;-) .

February previsions

Category : C#, English, General, Google Summer of Code 2008, Life, Mono, Programming, School

Mono

Thanks to the change of school semester at my school, I’m going to have a break in February of almost 4 weeks during which we are supposed to participate to some activities chosen among a list.

The cool thing is that I managed to make « contributing to Mono » one of those activities. Thus, I will have 4 weeks to work on Mono and integrating SoC’s ParallelFx work into the mainline tree with the blessing of my school. Yay :-) !

FOSDEM

There are also a bunch of other nice things I will be able to do this February. Going to FOSDEM is one of those (I took the train tickets this afternoon) which mean I can now wear the pretty badge too :

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

Unless we have trains problems on the way, we should be there for the Fosdem’s Beer Event too, so if you want to discuss Mono, ParallelFx or whatever around an (excellent) Belgian beer be sure to come by ;-) .

Prologin

Finally, I was qualified to go into the semi-finals of Prologin, the French algorithmic contest for young people, this year again. It will be a good occasion to see old faces.

For the first time there are also proposing C# as a qualification language. Hopefully, it will mean no on-the-fly C++ learning for me this time :-) .

Capharnaüm #6

Category : C#, English, Life, Mono, Programming, School

What’s happening since last time.

Programming

pdc2008brain

Appetizing isn't it ?

PDC is over. Lot of exciting stuff coming down the pipe for .NET and parallel programming. However, no C#4 announcement for parallelism support (Anders mentioned it in the intro of his keynote but no showcasing).

In other news, ParallelFx is now available directly in .NET 4 CTP (sucks that it’s only available in this virtual machine format thing). This new release brings a number of change which will be implemented in Mono when I have more free time (and a somehow formal API documentation). So far I haven’t see any breaking modifications except some API renaming, additional state exposure or adapting some of the stuff I had already coded in advance ;-) .

An interesting discovery with the DSS and CCR framework (next SoC maybe ? :-) ). Enjoyed Miguel keynote too.

Life

richard_stallman_2005_chrys

by chrys

I’m organizing with others a free software conference with Richard Stallman at school. Lot of stuff to work on. Btw it’s open to everybody (though mostly interesting for people located east of France), check out this page[fr] for more for informations.

Also doing some server management and setup for a school event. Things got hard when we tried to get LTSP to work with exotic thin clients (IBM NetVista boxes) but now it’s all ready.

It’s also school project time with a C BigInteger manipulation library and a C# Scheme interpreter (nicknamed Béchamel) to code.

Finally, exams are near the corner again. Not too much of them this time but I better not screw them :-P .

Review time !

Category : C#, English, Google Summer of Code 2008, Mono, Programming

Now that Google Summer of Code is drawing to an end, it’s time to summarize what has been achieved during these two months. Before anything else I would to thank again everybody (and most noticeably the folks from Mono and Google) for this summer, it has been awesome :) !

Base API

First of all let’s see how we compare to Microsoft implementation with the output of corcompare. As you can see, a lot of the reported missings come from boilerplate attributes and such, so we are pretty much feature complete (except for PLinq).

In addition, I showed throughout this blog different examples coming from Microsoft CTP releases and running on our implementation. Each one showcased a different part of the available API with every time a good performance boost. In order we have parallelized ray tracer, various samples and PLinq stress test.

As remarked, PLinq lags a little behind and will requires a good deal of love to complete operators and get some internal reworking until it’s satisfactory.

Extensions

Even though most of the additions are still draft, we already have some interesting bits.

System.Threading namespace got two new classes : AtomicBoolean and InterlockedExtensions. AtomicBoolean is a simple wrapper that allows the same sort of operations provided by Interlocked.Exchange and Interlocked.CompareAndExchange but on a boolean type.

InterlockedExtensions only contains a work-in-progress MCAS instruction at the moment. MCAS (Multiple Compare-and-Swap) is a handy primitive that can be used in a lot of situation. It has a semantic close to the traditional Compare-and-swap instruction (Interlocked.CompareExchange on .NET) except that it works with multiple target. Think of it as a series of CAS executed atomically and failing if any of the CAS failed. A naive implementation could be :

public static bool MCas<t>(T[] locations, T[] expected, T[] newValues)
{
    for (int i = 0; i < locations.Length; i++)
        if (locations[i] != expected[i])
            return false;
    for (int i = 0; i < locations.Length; i++)
        locations[i] = newValues[i];

    return true;
}

It’s particularly useful with linked-list like data structures that maintain more than one reference to nodes.

In System.Threading.Collections we also have a first draft of a concurrent skip-list that can be used both as a standard list (like List) and as a hash map (like Dictionary). Skip list are very interesting structures that possess a lot of the advantages of balanced trees but with much less complexity. The current version uses fine-grained locking in some place but I plan to use the aforementioned MCAS primitive to remove them completely.

Finally we have the new System.Threading.Actors namespace which is also much like a playground to my experimentations for the moment. It will allows actors-based concurrency with .NET, a feature that is already present in concurrent languages like Erlang and Java-ish languages like Scala and Clojure. For the moment the model is close to a teared-down version of the one used in Scala but I have to look into Clojure’s model which seems lighter and doesn’t require pattern matching to be usable.

I still have more ideas that could be useful for concurrent programming like more lock-free data structures and a STM manager. STM, which stands for Software Transactional Memory, is a concept close to the transaction model you have in database for example. The principle is that, when you want to do a set of operations concurrently on shared data, you encapsulate them in a transaction where read/write are logged. If, in the middle of your operations, the data is changed by another thread you are able to ‘rollback’ the transaction and try again. If no thread got in the way the modification can be ‘committed’ to the shared data and thus be visible by other threads. In practice it should be possible to adopt a helping scheme where threads that are colliding with an existing transaction will help it finish before doing their own work.

In conclusion, concurrency is becoming a first-class citizens these days and just cannot be ignored anymore by programmers. Big actors like Microsoft are already pushing for that with ParallelFx and, probably, future evolution of C# itself (C#4 ?) so we mustn’t lag behind ;) .

Funny parallelism

4

Category : C#, English, Google Summer of Code 2008, Linux, Mono, Programming

For those who have some free leisure time this weekend and would like to explore a bit ParallelFx in a didactic manner I ported to Linux a Sudoku game coming from ParallelFx CTP samples which optionally uses ParallelFx to generate Sudoku grids.

For screenshot fans that’s how it looks like :

Sudoku main interface

Sudoku main interface

You can enable the use of ParallelFx via the two options « Use multiple processors to generate puzzles » and « Use speculative puzzle generation » found in the Game settings box :

Game settings window

Game settings window

The first one uses Parallel.For instead of a standard for to generate grids and the other allows the background generation of further grids using Future.

Current PLinq : mixed results

Category : C#, English, Google Summer of Code 2008, Mono, Programming

As described in the roadmap, PLinq implementation work has started a week ago. Some good bits are already there like what I consider to be the 3 most important operators (because most of the others can be expressed using these ones) : Select, Where and Aggregate. Following is a discussion on the specificities of PLinq, difficulties they raise and, at the end, some examples/benchmarks.

A thing to be distinguished in PLinq contrary to standard Linq is the way data is processed. With Linq you can pretty much assume that data is processed in order. For example in Mono, Select is implemented like this internally :

static IEnumerable CreateSelectIterator (IEnumerable source, Func selector)
{
	foreach (var element in source)
		yield return selector (element);
}

So if the IEnumerable‘s iterator is built to return data in order (like most standard collections in .NET) this order will be kept. More importantly, some Linq’s operators would lose their purpose if data wasn’t kept in order. Operators such as ElementAt, Skip, First are dependent on indexing and thus can’t work reliably on scrambled data.

Now the problem is that PLinq doesn’t follow this intuitive ordering property and that for a simple reason : performances. Indeed, just like Parallel.For split the domain to parallelize work, PLinq’s queries are run on workers which just take randomly data from the source enumerable and return them back without reordering them (or, put in another way, ordered by which took the less processing time). By doing things like this we can yield optimum performances. On the other hand, most applications rely somehow on data ordering, be it implicitly or with operators like OrderBy, which mean we can’t just ignore that aspect of Linq.

The current way PLinq allows this ordering part is to have two separate extension methods for turning IEnumerable into IParallelEnumerable. The first, AsParallel(), allows the engine to process the data as it wants, so a developer can’t expect in which order its initial data will be given back to him. The second one, AsOrdered(), enforces strict order preservation at each step of the processing and, in fact, allows the class of operators I cited before to work as expected. However if ordering is explicitly turned on, either by AsOrdered() or OrderBy operator, the performance penalty can be important, killing the interest to parallelize queries in the first place (which is why it’s turned off by default).

At the moment only the first mode (unordered processing) is supported in Mono’s PLinq. Adding support for the other mode will be quite challenging with operators like Where because they tend to create gaps in the source enumerable, gaps which are very hard to track in a non-deterministic environment like ParallelFx. I still have no clear idea on how to handle this in Mono’s ParallelFx without too much locking (if anyone has pointers go ahead ;) ) .

Finally to close this post on a lighter touch, I have some examples which works on current implementation. First of all simple query like this one :

ParallelEnumerable.Range(1, 2000)
	.Where(i => i % 10 == 0)
	.Select(i => i + 1)

Yield around 1.5x performance improvement when run on my dual-core system. The actual precise number vary importantly on external things like how much data is queried, how long an operator takes to process one item, how selective are the predicates used …

A second example which yield very little improvement at the moment (an example of why PLinq isn’t a panacea) is this sort-of adaptive image blurrer. It takes a pixel and filter the others according to how much close their color are to the selected pixel’s one. If a pixel doesn’t comply to this predicate it gets blurred with a bloom-like effect :


(In the screenshot blue and white colors are supposed to be more blurred and thus fade in the background)

The parallelized version of this program is actually slower by a little amount than its synchronous counterpart on Linux but a little quicker on Windows for about the same run-time of the synchronous version on both platforms.

More parallel benchmarking

1

Category : C#, English, Google Summer of Code 2008, Mono, Programming

This weekend I played with the Microsoft ParallelFx CTP’s samples, adapting and running them on current Mono ParallelFx implementation. I’m going to show you two of them and the results. I used the same computer than previously for running the samples.

Image Colorizer

The first one is an image colorizer. For the test, I used a .jpg file weighting 8,3 Mio and containing 13273085 pixels (4207×3155). The result of the processing is below :

For the same color region, the parallel-enabled processing was 1.77x faster than the synchronous one (parallel time : 4.3368363s, non-parallel time : 7.6746839s).

As can be seen on the two following cpu graphs, the parallel version take advantages of both core in my CPU :

(Parallel version running)


(Non-parallel version running)

Benchmark sample

The second sample is a benchmark program containing a bunch of test. Each test uses one of the available API in ParallelFx. For the run, I enabled the 3 following benchmarks :

  • Matrix multiplication : two random-filled matrix are multiplied together. Both are 500 by 500 matrix.
  • Tree sum : all nodes of a binary-tree are added together. I set tree depth to 24 and each node’s value was randomly chosen.
  • Searching : a random word is searched among all files present in a directory. I used a novel with each chapter split in a distinct .html file. The content of the directory was duplicated so that its total size was 12,6 Mio.

Each of these tests were run 7 times by the program and the time given back was the average of each run with the two extreme values being removed. Following are the results with, in parentheses, the ParallelFx API used by the test.

  • Matrix multiplication : 2,25x speedup (Parallel.For).
  • Tree sum : 1,58x speedup (Future).
  • Searching : 1,37x speedup (Parallel.ForEach).

Conclusion

So we can see that ParallelFx always provide a boost but that the overall improvement seems to depend on the component used and how it is used. It would be interesting to compare these results with the ones yielded by Microsoft implementation and by the Mono implementation but running on .NET. I will probably do that at the end of GSoC as a conclusion of the work put during the summer ;) .

In addition these samples showed some bugs. Indeed, I had to disable two tests in the benchmark sample because one of them returned different results between parallel/non-parallel version and the other was causing a stack overflow in Mono runtime (I still need to find out if it’s Mono fault or mine).

Parallel goodness

3

Category : C#, English, Google Summer of Code 2008, Mono, Programming

I think I pretty much arrived to the point where my work is starting to get useful. With yesterday basic Tasks implementation and today Parallel.For work I was able to run a parallelized version of Luke’s C# RayTracer. There was already a version using ParallelFx in the CTP samples but I preferred to made my custom, simpler one :

After some micro-benchmarking on my dual-core system the overall speed improvement is around 42%. Deductively it should be 50% but due to the work-stealing algorithm overhead it’s a little less. Still it’s a good improvement over the classic and synchronous version.

Like the folks at the ParallelFx blog did I added some color mask to show which thread was doing the work :

As we can see the work is almost homogeneous between each of the processor. The small delta is explained by the fact that while one of the thread is in the middle of queuing all the work, the other is already processing it.

The code for the parallel ray tracer is here : http://monoport.com/25053. If you want to test it out grab a copy of the source from here in addition (warning : it’s probably full of bugs and won’t work on your system ;) ). Run the program with --thread-mask for the color mask.

I'm a fanboy

Category : English, General, Mono

So guess my surprise when the postman brought me yesterday a box with « Novell » and « Promotional Tee-shirt » written on it :) :


(Hehe, I’m so grim faced on this one :P )

Now I can hack away with all the goodies and a Rupert boss watching over me :


(When will it be able to bark « wake up », « idiot », « stop coding with your foot » like a real boss ?)

All of this thanks to über shana (who deserved to be a cited ™ person ;) ).

Report for GSoC first two weeks

Category : C#, English, Google Summer of Code 2008, Mono, Programming

Even though GSoC coding period started two weeks ago, I totally forgot to blog about it so here it is.

First week started very slowly as I had a bunch of school exams to take care of. During that time I set up a Windows VM to play with the library and wrote a first batch of stubs with some simple logic (getter and setter for instance).

This week was much more interesting because of two things : the Google Code Camp at my school and the ParallelFX’s team releasing a new CTP (beta) of the library. GCC was an event organized at my school by sam and Dave. The concept was simple : gather some geeks together in a big place for the night, with pizza and big amount of coffee/DarkDog, courtesy of Google, to hack on what we wanted. The association of these two events resulted in some productive things done.

At first I had planned to start the Scheduler part but with the new release (and thus the API additions) I preferred to stub/code the new parts like System.Threading.Collections which contains lock-free implementations of stack and queue, SpinWait/SpinLock which, as the name implies, provide locking facilities based on processor spinning and LazyInit/WriteOnce which, respectively, allows lazy evaluation of expression and readonly-like variable (which can be be set only one time and anytime). I pushed these changes this morning, together with a start of unit tests, in the Mono SoC repository.

The next three weeks will be at a slow pace as I have to prepare school stuff and my final exams in two weeks.