Real World F# Programming Part 2: Types
Ran into a situation last week that showed some more of the differences facing OO programmers moving to F#.
So I’ve got two directories. The program’s job is to take the files from one directory, do some stuff, then put the new file into the destination directory. This is a fairly common pattern.
To kick things off, I find the files. Then I try to figure out which files are in the source directory but not in the destination directory. Those are the ones I need to process. The code goes something like this:
So I plopped this code into a couple of apps I’ll code later, then I went to work on something else for a while. Since it’s all live — but not necessarily visible to anybody — a few days later I took a look to see if the app thought it had any files to process.
It did not.
Now, of course, I can see that my Array.filter is actually backwards. I want to take the filesThatMightNeedProcessing and eliminate the filesAlreadyProcessed. What’s remaining are the filesToProcess. Instead, I check to see if the second set exists in the first. It does not, so the program never thinks there is anything to do. Instead of Array.exists, I really need something like Array.doesNotExist.
So is this a bug?
I’m not trying to be cute here, but I think that’s a matter of opinion. It’s like writing SQL. I described a transform. The computer ran it correctly. Did I describe the correct transform? Nope. But the code itself is acting correctly. I simply don’t know how many files might need processing. There is no way to add a test in here. Tests, in this case, would exist at the Operating System/DevOps level. So let’s put off testing for a bit, because it shouldn’t happen here. If your description of a transform is incorrect, it’s just incorrect.
So I need to take one array and “subtract” out another array — take all the items in the first array and remove those items that exist in the second array. Is there something called Array.doesNotExist?
No there is not.
Ok. What kind of array do I have? Intellisense tells me it’s a System.IO.FileInfo
My first thought: this cannot be something that nobody else has seen. I’m not the first person doing this. This is just basic set operations. So I start googling. After a while, I come across this beautiful class called, oddly enough, “Set”. It’s in Microsoft.FSharp.Collections Damn, it’s sweet-looking class. It’s got superset, subset, contains, difference (which I want). It’s got everything.
So, being the “hack it until it works” kind of guy that I am, I look at what I have: an array of these FileInfo things. I look at what I want: a set. Can’t I just pipe one to the other? Something like this?
What the hell? What’s this thing about System.IComparable?
In order for the Set module to work correctly, it needs to be able to compare items inside your set. How can it tell if one thing equals another? All it has is a big bag of whatever you threw in there. Could be UI elements, like buttons. How would you sort buttons? By color? By size? There’s no right way. Integers, sure. Strings? Easy. But complex objects, like FileInfo?
Not so much.
As it turns out, this is a common pattern. In the OO world, we start with creating a type, say a Plain Old Java Object, or POJO. It’s got a constructor, 2 or 3 private members, some getters and setters, and maybe few methods. Life is good.
But then we want to do things. Bad things. Things it was never meant to do. Things involving other libraries. Things like serialize our object, compare it to others, add two objects together. It’s not enough that we have a new type. We need to start flushing out that type by supporting all sorts of standard methods (interfaces). If we support the right interface, our object will work magically with people who write libraries to do things we want.
Welcome to life in the world of I-want-to-make-a-new-type. Remember that class you had with three fields? Say you want to serialize it? You add in the interface IPersist. Now you have a couple more methods to fill out. Have some resources that must be cleaned up? Gotta add in IDisposable. Now you have another method to complete. Handling a list of something somebody else might want to walk? Plop in IEnumerable. Now you have even more methods to complete.
This is life in OO-land and frankly, I like it. There’s nothing as enjoyable as creating a new type and then flushing it all out with the things needed to make it part of the ecosystem. Copy constructors, operator overrides, implicit conversion constructors. I can, and have, spent all day or a couple of days creating a fully-formed, beautiful new type for the world, as good as any of the CLR types. Rock solid stuff.
Funny thing, I’m not actually solving anybody’s problem while I’m doing this. I’m just fulfilling my own personal need to create order in the world. Might be nice for a hobby, but not so much when I’m supposed to stay focused on value.
There’s also the issue of dependencies which is the basis for much of the pain and suffering in OO world. Now that my simple POJO has a dozen interfaces and 35 methods, what the hell is going on with the class when I create method Foo and start calling it? Now I’ve got all these new internal fields like isDirty or versionNum that are connected to everything else.
You make complex objects, you gotta do TDD. Otherwise, you’re just playing with fire. Try putting a dozen or so of these things together. It works this time? Yay! Will it work next time? Who knows?
This is the bad part of OO — complex, hidden interdependencies that cause the code to be quite readable but the state of the system completely unknown to a maintenance programmer. (Ever go down 12 levels in an object graph while debugging to figure out what state something is in? Fun times.)
So my OO training, my instinct, and libraries themselves, they all want me to create my own type and start globbing stuff on there. This is simply the way things are done.
DO NOT DO THIS.
Instead, FP asks us a couple of questions: First, do I really need to change my data structures? Because that’s going to be painful.
No. Files are put into directories based on filename. You can’t have two files in the same directory with the same name. So I already have the data I need to sort things out. Just can’t figure out how to get to it.
Second: What is the simplest function I can write to get what I need?
Beats me, FP. Why do you keep asking questions? Look, I need to take what I have and only get part of the list out.
I spent a good hour thrashing here. You get used to this. It’s a quiet time. A time of introspection. I stared out the window at a dog licking its butt. I wanted to go online and find somebody who was wrong and get into a flame war, but I resisted. At some point I may have started drooling.
In OO you’re always figuring out where things go and wiring stuff up. Damn you’re a busy little beaver! Stuff has to go places! Once you do all the structuring and wiring? The code itself is usually pretty simple.
In FP you laser directly in on the hard part: the code needed to fix the problem. Aside from making sure you have the data you need, the hell with structure. That’s for refactoring. But this means that all the parts are there at one time. Let me repeat that. THE ENTIRE PROBLEM IS THERE AT ONE TIME. This is a different kind of feeling for an OO guy used to everything being in its place. You have to think in terms of data structure and function structure at the same time. For the first few months, I told folks I felt like I was carrying a linker around in my head. (I still do at times)
Eventually I was reduced to muttering to myself “Need to break up the set. Need to break up the set.”
So I do what I always do when I’m sitting there with a dumb look on my face and Google has failed me: I started bringing up library classes, then hitting the “dot” button, then having the IDE show me what that class could do.
I am not proud of my skills. But they suffice.
Hey look, the Array class also has Array.partition, which splits up an array. Isn’t that what I want? I need to split up an array into two parts: the part I want and the part I do not want. I could have two loops. On the outside loop, I’ll spin through all the files in the input directory. In the inside loop, I’ll see if there’s already a file with the same name in the output directory. The Array.partition function will split my array in two pieces. I only care about those that exist in the input but not the output. Something like this:
Well I’ll be danged. Freaking A. That’s what I needed all along. I didn’t need a new class and a big honking type system hooked into it. I just needed to describe what I wanted using the stuff I already had available. My instinct to set up structures and start wiring stuff would have led me to OO/FP interop hell. Let’s not go there.
So if I’m not chasing things down to nail them in exactly one spot, how much should I “clean up”, anyway?
First, there’s Don’t Repeat Yourself, or DRY. Everything you write should be functionally-decomposed. There’s no free ride here. The real question is not whether to code it correctly, it’s how much to genericize it. All those good programming skills? They don’t go anywhere. In fact, your coding skills are going to get a great workout with FP.
I have three levels of re-use.
First, I’ll factor something out into a local structure/function in the main file I’m working with. I’ll use it there for some time — at least until I’m happy it can handle different callers under different conditions. (Remember it’s pure FP. It’s just describing a transform. Complexity is bare bones here. If you’re factoring out 50-line functions, you’re probably doing something wrong.)
Second, once I’m happy I might use it elsewhere, and it needs more maturing, I’ll move it up to my shared “Utils” module, which lives across all my projects. Then it gets pounded on a lot more, usually telling me things like I should name my parameters better, or handle weird OS error conditions in a reasonable way callers would expect. (You get a very nuanced view of errors as an FP programmer. It’s not black and white.)
Finally, I’ll attach it to a type somewhere. Would that be some kind of special FileInfo subtype that I created to do set operations?
As I mature the function, it becomes generic, so I end up with something that subtracts one kind of thing from another. In fact, let’s do that now, at least locally. That’s an easy refactor. I just need a source array, an array to subtract, and a function that can tell me which items match.
Note the lack of types. Do I care what kind of array either the source or the one to subtract is? No. I do not. All I care is if I can distinguish the items in them. Hell, for all I care one array can be that System.IO.FileInfo thing, the other array can be filenames. What does it matter to the problem I’m solving?
What’s that sound? It’s the sound of some other FP guy busy at his computer, sending me a comment about how you could actually do what I wanted in 1 line of code. That’s fine. That’s the way these things work — and it’s why you don’t roll things up into types right away. Give it time. The important thing was that I stayed pure FP — no new data, no mutable fields, no for/next loops. I didn’t even use closures. As long as I stay clean, the code will continue to “collapse down” as it matures. Fun stuff. A different kind of fun than OO.
So where would this code end up, assuming it lives to become something useful and re-usable? In the array type, of course. Over time, functions migrate up into CLR types. If I want a random item from an array? I just ask it for one. Here’s the code for that.
Let me tell you, that was a painful function to work through! Happy I don’t have to ever worry about it again. Likewise, if I need to know how many times one string is inside another? I’ve got a string method for that. Basically anything I need to use a lot, I’ve automated it.
Over time, this gives me 40-50 symbols to manipulate in my head to solve any kind of problem. So while the coding part makes my brain hurt more with FP, maintenance and understanding of existing code is actually much, much easier. And with pure FP, everything I need is right there coming into the function. No dependency hell when I debug. It’s all right there in the IDE. Not that I debug using the IDE that much.
So does that mean I never create new types? Not at all! But that’s a story for another day…
September 15, 2014