Philippe Milot

This is a brain dump of principles I learned from Haskell that can be applied to most modern programming languages. The examples use PHP, and this is written from the point of view of a developer working at Turbulent.

1. Dealing with state is dealing with the Devil.

Avoid modifying (or even reading from) state unless it is absolutely necessary. A very large number of our bugs is due to undesired state modification. In our context, all of the following are part of the state:

All variables declared outside the current function scope (member variables, global, and superglobals)
The cache
The database
The filesystem
Other OS functions (time, random numbers!!!)
(In case of javascript) the DOM and other window functions

Things that are NOT part of the state:

Constants (of any scope)
Function parameters, as long as they are not passed by reference

Why it’s important: A function which never interacts with the state in any way is called “pure” or “deterministic” (at least in non-lazy languages). These are functions that simply compute a result from an EXPLICIT (passed via arguments) set of data. Pure functions have many interesting qualities

They are easy to understand.
They can be easily verified with unit tests.
Once pure functions are verified, they usually do not change, so they seldom cause regression bugs.
They can be trivially parallelized in a language that supports it.
The result can be cached/memoized based on its input.

Of course, dealing with state is inevitable. When you HAVE to deal with state, make small functions that ONLY deal with state. Don’t mix computation code with code that deals with state. For example:

function isLocal() {
	$identity = Centrifuge::getAuthentication()->getIdentity();
	$city = $identity->city;
	return $city == Heap::config('myproject.city');
}

This is bad. isLocal() depends on the current session. Worse, it depends on a DB connection being available. It’s impossible to unit test. It’s impossible to re-use in a scripting context.

function isLocalAccount(HeapAccount $identity) {
	return isLocalCity($identity->city);
}
function isLocalCity($city) {
	return $city == Heap::config('myproject.city');
}

Both of these functions are pure ¹. The computation is encapsulated in isLocalCity(), which is so generic that it can be re-used in any context.

Another example, from CEC:

function validateGroupMembership($group_id, $membership_status=NORMAL) {
	//get current identity
	$ident = Centrifuge::getAuthentication()->getIdentity();
	//load group with all members
	$group = CecGroup::loadWithMembers($group_id);
	//find current identity in list of members
	foreach ($group->members as $member) {
		if ($member->id == $ident->id) {
			$found_member = $member;
			break;
		}
	}
	if (!$found_member) return new HeapError('ErrNotMember');
	//Check membership status
	if ($found_member->get('membership.status') == $membership_status) {
		return new HeapError('ErrBadMembership');
	}
	return new HeapSuccess();
}

What if I need to check two memberships? What if I already loaded the group? This is wasteful. We need to remove extra dependencies.

function validateGroupMembership(	HeapAccount $ident, 
									CecGroup $group, 
									$membership_status ) {
	//find $identity in list of members
	foreach ($group->members as $member) {
		if ($member->id == $ident->id) {
			$found_member = $member;
			break;
		}
	}
	if (!$found_member) return new HeapError('ErrNotMember');
	//Check membership status
	if ($found_member->get('membership.status') == $membership_status) {
		return new HeapError('ErrBadMembership');
	}
	return true;
}

Better, but it’s not clear that you need to load CecGroup with all members for this function to work. And what if I need to find the member for a reason other than to check membership status? Break it down further:

function findMember($member_id, array $members) {
	foreach ($members as $member) {
		if ($member->id == $member_id) return $member;
	}
	return null;
}

function hasMembership(HeapAccount $member, $membership_status) {
	if (is_null($member)) return false;
	return $member->get('membership.status') == $membership_status;
}

//This is just a convenience function for the most common use case now
function validateMembership(	HeapAccount $identity, 
								CecGroup $group, 
								$membership_status) {
	return hasMembership(findMember($identity->id, $group->members), $membership_status);
}

In general, a typical PHP Controller function will want to:

Call functions which retrieve data
Compute information from data
Render a template based on computed data

Make sure to encapsulate any Step 2 logic into pure functions and your code will become more reliable.

Javascript is especially tricky because often, there is very little to compute deterministically; most logic is a reaction based on external state. That’s why javascript code tends to get buggy very quickly. Is it hopeless? NO! You just have to think harder about what is being computed vs. what is I/O.

Function with side-effects should ONLY produce this side effect, and bear a name that indicates the side effect.

We usually do this with a verb, like setXXX() or printXXX().

Here is another example.

function getCoupon() {
	$session = Centrifuge::getSession();
	$code = $session->get('coupon');
	try {
		$coupon = Heap::construct('TyDiscount')->loadBy('code', $code);
	}
	catch (CFUObjectNotFoundException $e) {
		$session->clear('coupon');
	}
	return $coupon;
}

This is worse than the isLocal() function of the previous example. The reason is that in addition to depending on outside state, the function has side-effects. You should always encapsulate side effects in a separate function.

function loadCoupon(ICFUSession $session) {
	$code = $session->get('coupon');
	//Let client handle case when coupon doesn't exist
	return Heap::construct('TyDiscount')->loadBy('code', $code);
}
function clearCoupon(ICFUSession $session) {
	$session->clear('coupon');
}

We are attempting to remove more side-effects from Centrifuge. In the future, loadXXX() will only return data, and not modify anything else. This is to avoid bugs like:

$check = HeapAccount::checkPassword(
			$params['password'], 
			$account->loadProperty('password'));
if ($check) {
	Centrifuge::getAuthentication()->setIdentity($account);
	return new HeapSuccess($account); //OOPS! just exposed the password hash!
}

Program state also includes your class-level variables

It’s easy to avoid global variables, and it’s relatively easy to design a good object-oriented class that doesn’t touch external data. In OOP design, we were taught that encapsulating variables in classes is good design. In FP design, that’s not good enough, because methods lose their purity (and therefore lose all benefits outlined above).

class XmlParser {
	private $xml;
	public function __construct($xml) {
		$this->xml = $xml;
	}
	public function setXML($xml) {
		$this->xml = $xml;
	}
	public function getXML() {
		return $this->xml;
	}
	public function parse() {
		// .. custom parsing algorithm for XML ..
		return $parsed_data;
	}
}

This is considered pretty good OOP design. $xml is not public; its internal representation can change without affecting client code. This class can still cause state-related bugs, especially in multithreaded environments.

class XmlParserThread extends Thread {
	private $parser;
	private $output;
	public function __construct(XmlParser $parser) {
		$this->parser = $parser;
	}
	public function run() {
		$this->output = $this->parser->parse();
	}
	public function getResult() {
		return $this->output;
	}
}
$xmlp = new XmlParser($xml1);
//spin new thread which works
$thread1 = new XmlParserThread($xmlp);
$thread1->start();
$xmlp->set($xml2);
$thread2 = new XmlParserThread($xmlp);
$thread2->start();
$thread1->join();
$thread2->join();
$parsed1 = $thread1->getResult();
$parsed2 = $thread2->getResult();

I don’t even want to know what happened here. Two threads hold the same reference to XmlParser. There is a race condition between the first parse and the setXML(). Basically, we don’t know if $parsed1 holds the result of parsing $xml1 or $xml2.

In this case, we can solve the problem by removing the setXML() method, making XmlParser immutable. This is the technique we use in CFUAspect. There is no setXXX method. The only methods that affect the internal state of the object will return a COPY of the object with the modified data. If anyone else holds a reference to that object, that code will not suddenly break.

//Imagine if CFUAspect was mutable…
$aspect = CFUAspect::standard();
$obj1 = Heap::construct('HeapAtom')->load(1, $aspect);

//Do something else…

$aspect->addRelation('media', CFUAspect::standard());
$obj2 = Heap::construct('HeapAtom')->load(2, $aspect);

//Do something else…

$obj1->reload(); //AAHH! Reloaded with media!

This kind of problem can never happen because you cannot modify the state of a CFUAspect after it has been constructed. In other words, all class variables can be considered constants from the time of object creation. ²

We are looking into making CFUModel and CFUQuery immutable in the next Heap version as well.

One last thing: sometimes, OOP complicates things needlessly. There is no need for XmlParser to even exist. It can be replaced by:

function parseXml($xml) {
	// .. custom parsing algorithm for XML ..
	return $parsed_data;
}

You don’t always need to wrap everything into a class.

2. Functions should be SHORT and do only ONE thing.

You have heard this countless times before, but in FP it is orders of magnitude more important to observe this rule. This is because in FP, the benefits of writing short, reusable and pure functions increases exponentially, mainly due to higher-level functions (as we will see in section 4).

3. Use well-defined types whenever possible.

If your language has a type system, you need to leverage it to its maximum. Sadly, the three main languages that we use here have relatively poor type systems.

ActionScript: built-in, enable strict-mode typechecking in your compiler.
PHP: Type-hinting; only issues catchable errors, incurs runtime cost, and does not support some built-in type primitives like string (what?!)
Javascript: OUCH! NO support! Maybe we should seriously consider switching to Typescript for JS-heavy applications?

NULL values should be a separate type.

Don’t produce a NULL value unless you absolutely must. Have another type that indicates success/failure of an operation, or throw if it’s an unexpected error.

The idea is that the user of a function immediately knows if a function can fail. In Haskell, there are two specially “wrapper” types which encapsulate this functionality. They are Maybe and Either.

Maybe T, where T can be any other type, indicates that any object of this type can be either Just T or Nothing. In other words, an Int value can NEVER be null, but Maybe Int can. No type is nullable in Haskell unless it is wrapped around Maybe.

Either T1 T2, where T1 and T2 can be any other type, indicates that any object of this type can be either Left T1 or Right T2. In other words, a function returning Either String Int means that the function returns either Left String (by convention, in case of an error) or Right Int (in case of a success).

Of course, our languages don’t support these facilities, so what can we do? Well, in Heap we already have HeapSuccess and HeapError, so that’s a start.

It’s VERY annoying at first, but it saves you time in the end. If you MUST return null, (I’d like to see why), specify it clearly in the function documentation.

Principle: If you have 20 functions taking a parameter of type X, and returning a result of type X, do you want to handle the case when X is null in every function? NO! So your functions should NEVER return NULL (so they can be composed together, as we will see in Section 4).

4. Higher-order functions are your friends.

Higher-order functions are functions that either accept other functions as parameters, or return functions themselves, or both.

Traversing data structures is an algorithm completely separated from manipulation of the elements in the structure. Higher-order functions help encapsulate this traversal into a separate function.

Use Map/Reduce to process arrays and lists.

Every modern language has this built-in in some form or another. EVERY pure foreach loop you write can be rewritten this way, and many non-pure foreach also.

//Gives array('banane', 'pomme')
$lowered = array_map('strtolower', array('BaNaNe', 'POMME')); 

function concatpipe($result, $item) {
	return $result . "|" . $item . "|";
}

// Gives the string "|banane||pomme|"
$stringified = array_reduce('concatpipe', $lowered, '');

function countchars($result, $item) {
	return $result + strlen($item);
}

// gives 11
$strlen = array_reduce('countchars', $lowered, 0);

// Wait... don't reinvent the wheel.
// same thing, 11
$strlen = array_sum(array_map('strlen', $lowered));

Use partial application to create new functions that fit your API better.

Sometimes there are functions that you want to use, but they don’t “fit” with array_map or array_reduce, because they require more than one parameter. Partial application allows you to make them fit.

//The definition for 'explode' is:
//function explode($delimiter, $string) {}

$explode_on_comma = partial('explode', '.');

//Gives array('a','b','c')
$arr = $explode_on_comma('a,b,c');

//Gives array(array('tag', 'banane'), array('tag', 'pomme'));
array_map($explode_on_comma, array('tag.banane', 'tag.pomme'));

Taking into account partial application can help in deciding the order of your functions’ parameters. You should differentiate the arguments used to parametrize the function from the data that the function operates on. The data parameters should come last.

//Bad PHP! haystack is the data, it should come last
function strpos($haystack, $needle) {…}

//This is much better…
function _strpos($needle, $haystack) {
	return strpos($haystack, $needle);
}

//...because I can now do this:
$find_comma = partial('_strpos', ',');
$find_space = partial('_strpos', ' ');

Use combinators / function composition to create new functions.

What happens when you want to apply multiple functions to elements of an array?

$arr_tmp = array_map('strtolower', $_POST);
$arr_final = array_map('CFUStringHelper::purify', $arr_tmp);

That loops over the array twice. Not optimal. We need a new function.

function lowerthenpurify($elem) {
	return CFUStringHelper::purify(strtolower($elem));
}
$arr_final = array_map('lowerthenpurify', $_POST);

Better (only one loop), but very wordy and you may not want to name a function for something you’re only going to do once. Notice how the output of strtolower is fed directly as input to purify(). This is a common pattern known as function composition.

$f = compose('strtolower', 'CFUStringHelper::purify');
$arr_final = array_map($f, $_POST);

Now that’s easy to read! You can compose ad-infinitum, so this really becomes useful when doing multiple transformations on the same object.

// This long thing…
function f($elem) {
	$lowered = strtolower($elem);
	$purified = CFUStringHelper::purify($lowered);
	$len = strlen($purified);
	return is_even($len);
}
$is_purified_len_even = f($some_string);

// VS…
$f = compose('strtolower','CFUStringHelper::purify','strlen','is_even');
$is_purified_len_even = $f($some_string);

Composition is guaranteed to work as long as the output type of the preceding function matches the expected type of the following function ³. Example:

String -> strtolower -> String -> purify() -> String
String -> strlen -> Int -> is_even -> Boolean

We are already trying to do this pattern at two different places in CFU (validator and aspect filters). Both have different implementations! We need to standardize with compose().

Use partial application to compose multi-argument functions

Function composition works for functions taking one parameter only. Use partial application to convert any function to a function of one parameter.

$explode_on_comma = partial("explode", ',');
$explode_and_sort = compose(

	//First explode on comma, which gives an array...
	$explode_on_comma,
	
	//Cast every member of the array to an int
	partial("array_map", "intval"),
	
	//Sort members of the array in ascending order
	"array_sort");

$arr = $explode_and_sort("3.5,2.75,1.0"); // gives array(1,2,3);

Very expressive!! Custom, project-specific code is 3 statements long composed from short re-usable functions.

5. Applied example

This is a typical problematic function in our controllers:

class BaseController
{
	protected $Section = null;
	protected $Channel = null;
	public function getChannel() {
		$this->Channel = Heap::construct('HeapChannel')->loadBy($this->Section);
		return $this->Channel;
	}
	
	public function populateFeatured($type) {
		$q = Heap::construct('HeapAtom')->getQuery();
		$q->where(array(
			'type' => $type, 
			'channel_id' => $this->Channel->id));
		$res = $q->toResultset();
		foreach($res->resultset as &$obj) {
			//Need to fetch a custom model related to this atom
			$obj->set('related_mymodel', Heap::construct('MyModel')->loadBy('atom_id', $obj->id));
		}
		$this->set('featured', $res);
		return $res;
	}
}

class MainController extends BaseController
{
	$this->Section = 'Home';
	public function index() {
		$channel = $this->getChannel();
		$this->populateFeatured('post', 5);
		return $this->render('home.tpl');
	}
	
	public function credits() {
		//Oops, forgot to load channel! 
		$res = $this->populateFeatured('post', 4); //Oops, we loaded MyModel for no good reason!
		//Here we need the # of comments for each post
		foreach ($res->resultset as &$obj) {
			$q = Heap::construct('HeapReaction')->getQuery();
			$q->where('subject_id', '=', $obj->id);
			$q->where('subject_class', '=', 'HeapAtom');
			$obj->set('reaction_count', $q->toCount());
		}
		$this->set('featured', $res);
		return $this->render('credits.tpl');
	}
}

Here’s how it would look, with all principles taken into account

//No need for a base class, to make these functions as re-usable as possible
function getChannel($section) {
	$c = Heap::construct('HeapChannel')->loadBy($section);
	return $c;
}
function getFeaturedQuery($type, HeapChannel $channel) {
	$q = Heap::construct('HeapAtom')->getQuery();
	$q->where(array(
		'type' => $type, 
		'channel_id' => $channel->id));
	return $q;
}
function addMyModel($obj) {
	$obj->set('related_mymodel', Heap::construct('MyModel')->loadBy('atom_id', $obj->id));
	return $obj;
}
function addReactionCount($obj) {
	$q = Heap::construct('HeapReaction')->getQuery();
	$q->where('subject_id', '=', $obj->id);
	$q->where('subject_class', '=', 'HeapAtom');
	$obj->set('reaction_count', $q->toCount());
	return $obj;
}

class MyController extends HeapController
{
	public function index() {
		$fq = compose(
			'getFeaturedQuery', 
			'CFUQuery::toObjects', 
			partial("CFUResultSet::map", "addMyModel"));
		$featured = $fq('post', $this->getChannel('Home'));
		$this->set('featured', $featured);
		
		return $this->render('home.tpl');
	}
	public function credits() {
		$fq = compose(
			'getFeaturedQuery', 
			'CFUQuery::toObjects', 
			partial("CFUResultSet::map", "addReactionCount"));
		$featured = $fq('post', $this->getChannel('Credits'));
		$this->set('featured', $featured);
		
		return $this->render('credits.tpl');
	}
}

Disadvantages to the FP method

More time-consuming for quick-and-dirty projects unless you are really experienced
Forces you to name more things, which is a difficult thing to do
In PHP, there is a performance cost to partial() and compose(). It’s not enough to become noticeable in normal website routes, but when mapping over large arrays (such as in scripts), it’s better to use native, named functions.

Things I will still miss from Haskell:

Fast performance of compiled language.
Compiler strongly enforces all the principles above (you CAN’T not follow them) through Types!
Type information is preserved in function composition/partial application.
Laziness (values are only computed on-demand).

For the sake of argument, we consider the Heap config to be a series of immutable constants, and therefore not part of the state. ↩︎
Note that aspects are among the most stable and bug-free parts of CFU3 (coincidence?). ↩︎
Because functions do not carry type information in PHP, the interpreter will not help you if you try to compose functions with non-matching types; it will only give an error (which may be fatal or not) when you try to execute the composed function. ↩︎