What’a a good metric for programming language usage?

The whole “which programming language is most popular” debate was kicked off in my mind today by a tweet from @kellabyte.  She tweeted

I was outraged that a well respected blogger/tweeter such as kellabyte would tweet horrific lies of this sort. “This is exactly”, I thought, “the problem with our industry – too many people corrupted by fame and supporting their own visual basic.net related agendas.” Of course I was wrong: kellabyte has no interested in VB and her numbers were not wrong.

I have always relied on TIOBE’s measurement of programming language popularity to give me an idea of what the top languages are. I think this is likely kellabyte’s source also. The methodology used is quite extensively outlined at http://www.tiobe.com/index.php/content/paperinfo/tpci/tpci_definition.htm. If you don’t fancy reading all that the gist is that they use a series of search engines and count the number of results. The ebb and flow of these numbers is what makes up the rankings.

Obviously there are a number of flaws in this methodology:

  1. The algorithms used by the search engines are not static
  2. Not all programming languages are equally likely to be written about
  3. Languages and technologies are often conflated

Let’s look at each one of those. The search engine market is a constantly changing landscape. Google and Bing are always working towards improving ranking and how results are reported. There is going to be some necessary churn around ranking changes. TIOBE average out a number of search engines in the hopes they can normalize that problem. They use 23 different search engines which is a good number but many of them are very specialized search engines such as Deviant Art. Certain search engines are also given higher ranking for instance Google gives 28% of the final score. In fact the top 3 search engines account for 69% of the score. I’m no statistician but that doesn’t seem like a good distribution. Interestingly 4 out of the top 5 sources are Google properties with the 5th(wikipedia) being heavily sponsored by Google.

The second point is that programming langues are not all equally likely to be written about. My feeling is that newer languages and “cooler” languages will gain an unfair advantage here. People are much more likely to be blogging about them than something boring like VBA. I would say half the code I’ve written in the last  6 months has been VBA but I don’t believe I have more than 2 blog posts on that topic.

I’m guilty of this: when I talk about .net in most cases I’m really talking about C#. Equally when people talk about Rails they’re talking about ruby. I’m not convinced that this information is well captured in TIOBE. It is a difficult problem because a search for “rails” is likely to return far more hits than just those related to programming. Context is important and without some natural language processing capabilities I don’t see how TIOBE can be accurate.

The alternatives to TIOBE are not particularly promising. James McKay suggested that looking at job posting and github project would be a better metric. He specifically mentioned the job aggregator http://www.itjobswatch.co.uk/. I’ve been thinking about this and it seems like a pretty good metric. The majority of development is likely done inside companies so looking at a job site gives a window into the inner workings of companies. Where it falls down is in looking at companies which are too small to post jobs and open source software. The counter balance to that is found in github statistics.  These statistics are likely to have the opposite bias favoring upstart languages and open source contributions. I think we’re at the point where if you’re running an open source project you’re running it on github which makes it an invaluable source of data.

To the mix I would add stackoverflow as a source of numbers. They are a large enough question and answer site now that they’re a great source of data. I’m not sure what the biases would be there – C# perhaps?

Combining these statistics would be an interesting exercise – perhaps one for a quickly approaching winter’s day.

Document control and DDD/CQRS – solving similar problems

I had the good fortune to have a two hour introduction to the world of document control the other day. It was refreshing to see that we programmers aren’t the only ones who don’t have things figured out yet. The entire document control process is an exercise in managing the flow and ownership of data. I spent a lot of time thinking about how similar the document control problem and the data flow problem mirror each other.

Document control is really interested in documents and doesn’t care at all about the contents of these documents. Their concerns are largely around

  • who owns this document?
  • what is the latest version of this document?
  • how is this document identified?
  • how long do I have to keep this document?
  • is this document superseded by some other document?

These sound a lot like issues with which we deal when using DDD. Document ownership is a simply a problem of knowing in which aggregate root a document belongs. Document versioning is similar to maintaining an event stream. Document identification is typically done through numbering – however the flow of documents is slow enough that sequential numbering isn’t a problem – no need for a randomly generated GUID.

Document retention isn’t one with which we typically spend much time in CQRS land. Storage is cheap so we just keep every version around or at least we’re able to generate every version through event sourcing. Perhaps the most congruent concept is taking snapshots of aggregates, but we’re typically only interested in the most recent version of the aggregate.  With document control there is always some degree of manual intervention with documents so there is a significant cost to retaining all documents indefinitely. I’m only talking about digital copies of document here, Zuul protect you if you need to track paper copies of things too. I can’t even keep track of my keys let alone tens of thousands of documents. My strategy for paper documents would be to burn them as soon as I got them and refer people to the digital version.

Superseding documents also doesn’t seem like a problem we typically have in CQRS. In document control one or more documents may be superseded by one or more documents.  For instance we may have a lot of temporary documents which are created by the business things like requests to move offices. They have value but only in a transitory way. Every week the new office seating chart is built from these office move documents and the documents discarded. Their purpose is complete and we no longer care about them as we have a summary document.

Many documents become one. I call it a Voltron operation.

Many documents become one. I call it a Voltron operation.

In the opposite operation a document can be replaced by a series of documents. This activity is prevalent when adding detail to documents. A single data sheet may become several documents when examined in more detail

Reverse Voltron? Fan-out? The name may need some work.

Reverse Voltron? Fan-out? The name may need some work.

This was originally going to be a post about how much we in the DDD/CQRS community have to learn from document control. I imagined that document control was a pretty old and well defined problem. There would surely be well defined solutions. I did not get that impression.

The problem of canonical source of truth or “who owns the data” is a very difficult one in document control. We’re spoiled in DDD because it is rare indeed that the owner of a piece of data can change during its lifespan. Typically the data would remain within an AR and never updated without the involvement of the AR. With document control it is probable that responsibility could jump from your AR to some other, possibly unknown, AR. It could then jump back. At any point in time it would be impossible, without querying every AR, who had control of the data. Of course with a distributed system like many people working on a document it is possible that there will be disagreement about which AR has responsibility at any one time. Yikes!

What we can learn from document control

I think that looking at document control gives us a window into what can happen when you relax some of the constraints around DDD. Data life-cycle is well defined in DDD and we know who owns data. If you don’t then you end up in trouble with knowing who is the source of truth. Document control must solve this problem constantly and it can only be done by going out and asking stakeholders a lot of questions – a time consuming exercise.

The introduction of splitting and combining documents, or in our case aggregates, over their lifetime is disastrous. You lose out on the history of information and knowing where to apply events becomes difficult. Instead we should retain aggregates as unchanged as possible (in terms of what fields they have, obviously the data can change) and rely on projections of the data to create different views of information. This is basically impossible to apply to formatted documents as you would have in document control.

What I think would help out document control

The first thing which comes to mind as being directly applicable to document control is removing meaning from the document identifiers. The documents document control manages tend to be numbered and the temptation to add meaning to a document number is too tempting to turn down. For instance you might get a number like

P334E-TT-6554

In our imaginary scheme all documents which start with P are piping diagrams. The 334 denotes the system to which it belong, E the operating pressure and TT the substance inside the pipe. The final digits are just incrementally assigned. The problem is just what you would expect: things change. When they do a decision must be made to either leave the number intact and damage its reliability or to renumber the document and lose the history. instead document control would do well to maintain an identifier whose sole purpose is to identify the document. The number can be retained but only as a field.

A more controversial assertion is that document control should retain all documentation. We retain a full history of messages used to build an entity, even if it is offline and used in favor of a snapshot. I believe that document control should do the same thing. Merging and splitting document is problematic and complicated. It is easier to just create a new document and reference the source documents. Ideally the generation of these new documents can be treated as a projection and the original documents retained.

In the end it is interesting to see how similar problem domains are solved by different people. That’s the beauty in learning a new development language; every language has different features and practices. I’m not, however, prepared to be the guy who learns document control in depth to bring their knowledge back to the community.

The city of calgary doesn’t get open data

It seems that the City of Calgary has updated its open data portal. I was alerted to it not by some sort of announcement but by a tweet from Grant Neufeld who isn’t a city employee any shouldn’t be my source of information on open data in Calgary.

The new site is better than the old one. They have done away with the concept of having to add data to a shopping card and then check out with it. They have also made the data sets more obvious by putting them all in one table. They have also opened up an app showcase which is a fantastic feature. It can’t help to cross promote apps which make use of your data. There are also a few links to Google and Bing maps which do an integration with the city’s provided KML files. As I’ve said before I’m not a GIS guy so most of that is way over my head.

It is a big step forward… well it is a step forward. I know the city is busy with more important things than open data but the improvements to the site are a couple of day’s worth of work at best. What frustrates me about the process is that despite having several years on lead time on this stuff the city is still not sure about what open data is. I draw your attention to the FoIP requests CSV. First thing you’ll notice is that despite being listed as a CSV it isn’t, it is an Excel document. Second is that the format is totally not machine readable, at least not without some painful parsing of different rows. Third the data is a summary and not the far more useful raw data. I bet there is some supposed reason that they can’t release detailed information. However if FoIP requests aren’t public knowledge then I don’t know what would be.

Open data is not that difficult. I’ve reproduced here the 8 principles of open data from http://www.opengovdata.org/home/8principles

1. Data Must Be Complete

All public data are made available. Data are electronically stored information or recordings, including but not limited to documents, databases, transcripts, and audio/visual recordings. Public data are data that are not subject to valid privacy, security or privilege limitations, as governed by other statutes.

2. Data Must Be Primary

Data are published as collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.

3. Data Must Be Timely

Data are made available as quickly as necessary to preserve the value of the data.

4. Data Must Be Accessible

Data are available to the widest range of users for the widest range of purposes.

5. Data Must Be Machine processable

Data are reasonably structured to allow automated processing of it.

6. Access Must Be Non-Discriminatory

Data are available to anyone, with no requirement of registration.

7. Data Formats Must Be Non-Proprietary

Data are available in a format over which no entity has exclusive control.

8. Data Must Be License-free

Data are not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed as governed by other statutes.

 The city is failing to meet a number of these. They are so simple, I just don’t get what they’re missing. The city employees aren’t stupid so all I can conclude is that there is either a great deal of resistance to open data somewhere in the government or nobody is really convinced of the value of it yet. In either case we need a good push from the top to get going.

So I wrote a book

I’ve been pretty quiet on the old blog front as of late. This is largely attributable to me being busy with other things. The most interesting of which, in my mind, is that I wrote a book .  It isn’t a very long book and it isn’t a very exciting book but I’m still proud of having written the little guy. This post is less about the book itself and more about what it was like to write a book.

6542OS_mockupcover_normal

First off it is a lot of work. Far more work than I was originally expecting. I’ve written lengthy things before, most notably about 100 pages during my masters. This was different because I didn’t feel like I knew the content as well as I did for the masters paper. Having restrictions on the length of the chapters was the most difficult part. Due to some confusion about the margins for a page I started by writing the equivalent of 15 pages for a 10 page chapter. I did this for 4 chapters before my editor caught it. I agreed to cut the content down and get back on track in accordance with the outline.

This was a mistake. It was really hard to cut content to that degree. A few words here or there was easy enough but what amounted to a third of the chapter content? Tough. Later in the project I realized that keeping within the outline pages was not nearly as important as I had been lead to believe. After throwing the limits out the window the writing process became much easier.

In order to treat writing a book with the same agile approach one might use for developing software it seems crucial to not involve page counts at all. A page count is a poor metric and I have no idea why one would optimize for it. Obviously there should be some rough guidelines for the whole thing you don’t want to end up with a 1000 page book when you only set out to write 200 nor do you want 200 pages when you set out to write 1000. But writing to within 50% of the target length is reasonable.

To put too much emphasis on length is to lose sight of the goals of the book. These are much more along the lines of education or entertainment or something like that. The goal isn’t to kill X trees.

Are books still relevant?

Umm, <mumble> <mumble>. I don’t know to be honest. I don’t read many programming books these days, I spend my time reading blogs and tutorials instead. I think there is still a space for paper form technical books even in a fast moving world like computer programming. There is certainly a place for books about techniques or styles or about the craft of programming in general. I have some well thumbed copies of Code Complete and The Pragmatic Programmer and even Clean Code. I do not, however, think there is a place for technology specific paper books. That target moves too quickly.

The long form technical document is not dead it just needs to remain spry. If you’re going to publish a longer book style document then publishing it in a form which can be changed and updated easily is key. This is where wikis and services like leanpub come into their own. As an author you need to keep updating the book or open it to a community which will do updates for you.

Would I do it again?

Not at the moment. Not through a traditional publisher. Not on my own.

I’ve had enough of writing books for now. I’m going to take a break from that, likely a long break. I might come back to it in a year or two but no sooner. I think I can understand why authors frequently have long breaks between their books. It is an exhausting slog, a death march really.

There was nothing wrong, per say, with Packt publishing. They did pretty good work and I liked my editor or editors or many many editors… I’m not sure how many edits we had on some chapters. Frequently it would be edited by person X and then those edits reversed by person Y. There didn’t seem to be an overall guiding hand which was responsible for ensuring a quality product. Good editing has to be the selling feature for publishers, the way they attract both authors and purchasers. It is the only thing which sets them apart from self publishers.

Self publishing and micro-printing is coming into its own now. By micro-printing I mean being able to produce small runs of books economically rather than printing in very small text. If I were to do it again I would take this route. I would also hire a top notch editor who would stay with the project the whole time, somebody like @SocksOnBackward(she would tell me that top notch should be top-notch).

I would also like to work with somebody. Writing alone is difficult because there is nobody off of whom you can bounce ideas. I certainly could have reached out to random people I know in the community but it is a lot to ask of them. I would be much happier having somebody who could share the whole endeavour with me.

I guess watch this spot to see if I end up writing another book.