The Mythical Man-Month

The Mythical Man-Month

Featured Image source credit to Patrick Tomasso.

The Mythical Man-Month was one of those seminal texts that I had never got around to reading. First published in 1975, it’s often mentioned as of those books which “should be on every Developer’s book shelf”. So when I got a copy earlier this week, I set about righting that wrong.

Overall I found the book to be well written and the content to be staggeringly relevant. That’s no mean feat considering the book is over four decades old. That being said, there were more than a few arguments which I found somewhat controversial and some I disagreed with altogether. A rather unique characteristic of the anniversary edition of the book is that the author revisited points made in the original edition and was able to reflect and provide some additional context and justification.

When I read a book such as this I like to identify a small number of takeaways, things that I recommend one bear in mind when doing their day job, as they may well make things a little better. Here’s my top five for the Mythical Man-Month:

  1. Senior people must be kept technically and emotionally ready to jump back into the code (Architects, I’m looking at you)
  2. Each team should have their set of specialised tools in order to do their jobs efficiently
  3. Programming productivity can be increased many times when using a suitably high-level language
  4. Delegate power down to the individuals — creativity in the small
  5. Adding more people to an already late project (often) makes it much later

Should you read the Mythical Man-Month? Yes. If not for the timeless pearls of software wisdom, then for a first-hand insight into the wild-west that was the 1970’s software landscape.

Geospatial Software Development – The Geometries

Geospatial Software Development – The Geometries

Featured image credit to OpenStreetMap contributors, full copyright information available here.

For the past few years I’ve been working in the Geospatial domain, developing various Geographic Information Systems (GIS). Although some consider GIS a bit of a dark art, and although in some ways they’re right, it’s actually fairly straight forward to get to grips with the underlying concepts.

The Data

Geographic data is broadly found in two different representations. Firstly you have raster data, which are images. If you bought a map in a shop you would likely be looking at a raster representation of some geographic data. The other representation is vector data. I’m sure there’s probably a super-accurate mathsy definition of vector data, but you’re really just dealing with geometries — shapes. It’s common for geospatial data to be stored in vector form and then used to create rasters. We’ll focus on the vector representation of geospatial data here.

The Geometries

We need a standard way of representing the vector data. Just like the “basic” programming types such as Boolean, String, Integer etc there are a set of geometry types which you’ll see across languages and GIS projects. The geometries available tend to differ slightly depending on what you’re developing with, so we’ll cover the more common ones here. We’ll also focus on the 2D geometries, although 3D geometries do exist.

There are a multitude of ways that these geometries are represented. For example you have the ESRI Shapefile, GML, KML, GeoJSON and many more. For familiarity and simplicity, we’ll use GeoJSON examples.

Point

Let’s say you take a map and you stick a pin in it, you’ve just created a Point. This Point has has a x value and a y value, and that’s it.

Geometries_Point

{
    "type": "Point",
    "coordinates": [
        102.0, // Here's your x value
        0.5    // Here's your y value
    ]
 }

MultiPoint

A MultiPoint is a collection of Points. To keep up the analogy, let’s say you decided to stick 3 pins in your map, perhaps to represent places you would like to visit, you now have a MultiPoint.

Geometries_MultiPoint

{
 "type": "MultiPoint",
 "coordinates": [
     [             // Here's your first Point
         100.0,
         0.0
     ],
     [             // Here's your second
         101.0,
         1.0
     ],
     [             // Here's your third
         102.0,
         1.0
     ]
 ]
} 

LineString

A LineString is line drawn on a map. Let’s say you place two pins on a map, one to represent where you are and another to represent where you are travelling to. You link the pins with some string. There you have it, a LineString.

Geometries_LineString

{
  "type": "LineString",
  "coordinates": [
    [
      100.0,
      0.0
    ],
    [
      101.0,
      1.0
    ]
  ]
}

You may have noticed that the structure of the LineString example is very similar to the MultiPoint example. This makes sense as we are dealing with a bunch of Points in both cases, it just so happens in the LineString case that we join the Points up to create a line.

MultiLineString

You’ve probably guessed by now, but a MultiLineString is a collection of LineStrings. Let’s say you were planning a trip for multiple people and needed to plan each of their travel routes on a map, just like you did in the LineString example. Multiple lines in a single geometry, a MultiLineString.

Geometries_MultiLine

{
 "type": "MultiLineString",
 "coordinates": [
     [               // First line
         [
             100.0,
             0.0
         ],
         [
             101.0,
             1.0
         ]
     ],
     [              // Second line
         [
             102.0,
             2.0
         ],
         [
             103.0,
             3.0
         ]
     ]
 ]
}

Polygon

A Polygon is a shape drawn on a map in which the first Point and the last Point are identical — it’s a “closed” shape. Let’s say you wanted to plot the area of your house on a map, you would stick a pin for each corner of your house, join up the pins, with the string visiting each pin to create a shape. A Polygon can be as simple or as complex as it needs to be, it just needs to close.

Geometries_Polygon

{
  "type": "Polygon",
  "coordinates": [
    [
      [          // Here's your first Point
        100.0,
        0.0
      ],
      [
        101.0,
        0.0
      ],
      [
        101.0,
        1.0
      ],
      [
        100.0,
        1.0
      ],
      [         // Here's the last Point, note that they're the same
        100.0,
        0.0
      ]
    ]
  ]
}

It is valid for a Polygon to contain more than one ring. For example, Let’s say you wanted to plot a donut-like shape on a map, you would have a Polygon for the outer ring and another for the inner ring.

MultiPolygon

For when a single Polygon is just not enough. Let’s say you were plotting a University campus, or some other place made up of multiple buildings, all the buildings are “the campus”, but they are also separate buildings. We could represent each of the buildings as a Polygon and then wrap them in a MultiPolygon so that they belong to the same geometry.

Geometries_MultiPolygon

{
  "type": "MultiPolygon",
  "coordinates": [
    [               // Polygon 1
      [            
        [
          102.0,
          2.0
        ],
        [
          103.0,
          2.0
        ],
        [
          103.0,
          3.0
        ],
        [
          102.0,
          3.0
        ],
        [
          102.0,
          2.0
        ]
      ]
    ],
    [                  // Polygon 2
      [
        [
          100.0,
          0.0
        ],
        [
          101.0,
          0.0
        ],
        [
          101.0,
          1.0
        ],
        [
          100.0,
          1.0
        ],
        [
          100.0,
          0.0
        ]
      ]
    ]
  ]
}

Yes, with a MultiPolygon it would even be possible to plot a University campus made entirely of donut-like shaped buildings.

Wrap-up

For more information on what sort of geometries are available checkout the OGC’s “Simple Features” standard or the ESRI Shapefile technical description.

You may have noticed that the numeric values we’ve been using for the geometry examples have been pretty arbitrary and that’s because they have been. To get a handle on how we put a shape in a certain place on the planet, we’ll need to discuss Spatial Reference Systems (SRS), which we’ll look at in the next post.

Comments? Suggestions? Drop me a comment below.

Takeaways from QCon London 2017 – Day 3

Takeaways from QCon London 2017 – Day 3

Here’s day 3. Day 1 can be found here and Day 2 can be found here.

The Talks

  1. Avoiding Alerts Overload From Microservices with Sarah Wells
  2. How to Backdoor Invulnerable Code with Josh Schwartz
  3. Spotify’s Reliable Event Delivery System with Igor Maravic
  4. Event Sourcing on the JVM with Greg Young
  5. Using FlameGraphs To Illuminate The JVM with Nitsan Wakart
  6. This Will Cut You: Go’s Sharper Edges with Thomas Shadwell

Avoiding Alerts Overload From Microservices

  • Actively slim down your alerts to only those for which action is needed
  • “Domino alerts” are a problem in a microservices environment — one service goes down and all dependent services fire alerts
  • Uses Splunk for log aggregation
  • Dashing mentioned for custom dashboards
  • Graphite and Grafana mentioned for metrics
  • Use transaction IDs (uses UUIDs) in the headers of requests to tie them all together
  • Each service to report own health with a standard “health check endpoint”
  • All errors in a service are logged and then graphed
  • Rank the importance of your services – Should you be woken up when service X goes down?
  • Have “Ops Cops” — Developers charged with checking alerts during the day
  • Deliberately break things to ensure alerts are triggered
  • Only services containing business logic should alert

How to Backdoor Invulnerable Code

A highly enjoyable talk of infosec war stories. 

Spotify’s Reliable Event Delivery System

  • The Spotify clients generates an event for each user interaction
  • The system is built on guaranteed message delivery
  • Runs on Google Cloud Platform
  • Hadoop and Hive used on the backend
  • Events are dropped into hourly “buckets”
  • Write it, run it culture
  • System monitoring for:
    • Data monitors – message timeliness SLAs
    • Auditing – 100% delivery
  • Microservices based system
  • Uses Elasticsearch + Kibana
  • Uses CPU based autoscaling with Docker
  • All services are stateless — cloud pub/sub
  • Machines are built with Puppet for legacy reasons
  • Apparently, Spotify experienced a lot of problems with Docker — at least once an hour
  • Services are written in Python
  • Looking to investigate Rocket in future

Event Sourcing on the JVM

  • Event sourcing is inherently functional
  • A single data model is almost never appropriate, event sourcing can feed many and keep them in sync e.g:
    • RDMS
    • NoSQL
    • GraphDB
  • Kafka can be used as an event store by configuring it to persist data for a long time, however this isn’t what it is currently intended to do
  • Event Store mentioned
  • Axon Framework mentioned
    • Mature
  • Eventuate mentioned
    • Great for distributed environments/geolocated data
  • Akka.persistence
    • Great, but needs other Akka libraries
  • Reactive Streams will be a big help when dealing with event sourcing

Using FlameGraphs To Illuminate The JVM

  • Base performance on requirements
  • Flamegraphs come out of Netflix
  • Visualisation of profiled software
  • First must collect Java stacks
  • JVisual VM mentioned
  • Linux Perf mentioned

This Will Cut You: Go’s Sharper Edges

  • It is possible, in some cases, to cause Go to crash through reading (JSON, XML etc) without closing tags —  it just tries to read forever (DOS attack)
  • Go doesn’t have an upload size limit, put your go servers behind a proxy with an upload size limit to mitigate this e.g NGINX, Apache HTTP
  • Go doesn’t have CSRF protection built-in, this must be added manually
  • DNS rebinding attacks may be possible against Go servers

That about wraps it up for my summary QCon London 2017.

 

 

 

Takeaways from QCon London 2017 – Day 2

Takeaways from QCon London 2017 – Day 2

Now for day 2. If you haven’t caught up with Day 1 check it out here, Day 3 can be found here.

The Talks

  1. Deliver Docker Container Continuously in AWS with Philipp Garbe
  2. Continuous Delivery the Hard Way with Kubernetes with Luke Marsden
  3. Low Latency Trading Architecture at LMAX Exchange with Sam Adams
  4. Why We Chose Erlang Over VS. Java, Scala, Go, C with Colin Hemmings
  5. Scaling Instagram Architecture with Lisa Guo
  6. Deep Learning @ Google Scale: Smart Reply In Inbox with Anjuli Kannan

Deliver Docker Container Continuously in AWS

  • A big pro for Amazon EC2 Container Service (ECS) over other container orchestrators is that you don’t have to worry about the cluster state as this is managed for you
  • AWS CloudFormation is the suggested way to manage an ECS cluster, although there was also a mention of Hashicorp’s Terraform
  • Suggested to use Amazon’s Docker registry when using ECS
  • AWS CloudFormat or the CLI suggested for deployment
  • There are 2 load balancers to choose from, the Application Load Balancer (ALB) and the Classic Load Balancer (ELB)
    • The ALB only uses HTTP, but has more features
    • The ELB does HTTPS, but only allows for static port mapping — this only allows for one service per port per VM
    • ALB required for scaling on the same VM
  • Suggested to use load balancing rules based on memory and CPU usage
  • ECS does not yet support newer Docker features, such as health check
  • The Elastic Block Store (EBS) volume is per VM and doesn’t scale that well
  • The Elastic File System (EFS) scales automatically and is suggested
  • You can have granular access controls in ECS by using AWS Identity and Access Management (IAM)
  • Challenges currently exist when using the EC2 metadata service in a container
  • ECS does not support the Docker Compose file
  • ECS does not natively support Docker volumes

Continuous Delivery the Hard Way with Kubernetes (@Weaveworks)

  • Weaveworks use their git master branch to represent production
  • Weave use Gitlab for their delivery pipeline
  • Katacoda was used to give a Kubernetes live demo and it all worked rather well
  • Plug for Weaveworks Flux release manager

Low Latency Trading Architecture at LMAX Exchange

  • Manage to get impressively low latency using “plain Java”
  • Makes great use of the Disruptor Pattern — lots of Ring Buffers
  • Focus on message passing
  • Minimise the amount of network hops
  • Uses in-house hardware, not the cloud
  • Uses async pub-sub using UDP
    • Low latency
    • Scaleable
    • Unreliable
  • Mention of Javassist
  • Stores a lot of things in memory, not the database
  • Describes using an Event Logging approach
  • Java primitives over objects for performance/memory management reasons
  • Makes use of Java Type annotations for type safety
  • Mention of the fastutil library
  • Mention of using the @Contended annotation
  • Uses the commercial JVM Zing for improved garbage collection and performance
  • Mentions manually mapping Java threads to CPU cores using JNI/JNA for increased performance

Why We Chose Erlang Over VS. Java, Scala, Go, C

  • Develops Outlyer monitoring system
  • Version 1 was  a MEAN stack monolith with a Python data collection agent
  • Version 2 was microservices
  • Focus on separating stateless behavior services from stateful services for scaling reasons
  • Uses RabbitMQ for async communication between services
  • Uses Riak for timeseries data
  • Added a Redis cache layer to improve performance
  • Erlang the movie?
  • Erlang uses the Actor Model, let it crash!
  • Erlang has pre-canned “behaviours
  • Erlang has an observer GUI — allows for tracing and interaction with a live application
  • Erlang offers live code reload
  • Erlang has a supposed weird syntax
  • Elixr is a newer language that runs on the Erlang VM — supposedly has a syntax more like Ruby
  • Mentions using DalmatinerDB
  • Outlyer blog for more information

Scaling Instagram Architecture

  • Instagram runs on AWS
  • Python + Django, Cassandra, Postgres, Memcache and RabbitMQ.
    • Postgres for users, media and friendship
    • Cassandra for user feeds and activities
  • Memcache is used to avoid constantly hitting the DB
  • Uses perf_event on Linux
  • Focus on reducing latency
  • Sometimes migrates regularly used code from Python to C for improved performance — Cython mentioned
  • Always-on custom performance analysis in production – a performance hit but for for a lot of insight
  • Disabled the Python garbage collector for performance reasons
  • Uses Python asyncio where there was previously sequential service calls
  • Uses Tao linked DB
  • 40-60 deploys per day to 20,000+ servers with each deploy taking about 10 minutes!

Deep Learning @ Google Scale: Smart Reply In Inbox

  • Google Inbox’s smart reply learns all responses from data
  • Explains deep learning concepts using a feed forward network which uses logistic regression and gradient descent
  • Data fed into the network must be a vector of floats
    • All words in a dictionary are given an numerical index
    • A string of words is converted into a vector of its equivalent numerical representation
    • Dimensionality reduction is employed to produce an “embedded vector” — essentially a lossy compression algorithm
  • Deep Learning models allow for automatic feature learning
  • See the paper Distributed Representations of Words and Phrases and their Compositionality
  • A Recurrent Neural Network is used by this project and by natural language processing in general
  • See the paper A Neural Conversational Model
  • The project can’t match tone — yet.
  • A whitelist is used to prevent bad language
  • The current approach doesn’t always give diverse answers
  • Google Translate uses a very similar model
  • Mentions colah.github.io for more information

That wraps up day 2, here’s day 3.

Takeaways from QCon London 2017 – Day 1

Takeaways from QCon London 2017 – Day 1

As a longtime reader of InfoQ, QCon has always been on my radar. In its 11th year, I managed to attend QCon London. For those that haven’t heard of QCon, it’s a top-notch software development conference, held all around the world. It’s spread over three days with each day broken into tracks. Each track has a particular focus, such as “Architecting for Failure”, “Containers: State of the Art” or my personal favourite “Security: Lessons Learned From Being Pwned”.

Overall, I found the conference to be of a superbly high quality. The talks were relevant and well delivered, the venue ideal and the food abundant. What follows are my takeaways from day 1.

The Talks

  1. Strategic Code Deletion with Michael Feathers
  2. Using Quality Views To Tackle Tech Debt with Colin Breck of Tesla
  3. Continuously Delivering Security In The Cloud with Casey West
  4. From Microliths To Microsystems with Jonas Bonér
  5. Building a Data Science Capability From Scratch with Victor Hu
  6. Crushing Tech Debt Through Automation at Coinbase with Rob Witloff

Strategic Code Deletion

  • If you don’t already have it, a test coverage metric is a great place to start measuring tech debt
  • Mutation testing is an even better way of measuring true test coverage, but the tooling is pretty lacking
  • If code has very little value, consider removing it
  • Mentions of using Martin Fowler’s “Strangler Application” approach

Using Quality Views To Tackle Tech Debt

I recommend that you checkout Colin’s blog post for the details on Quality Views.

Continuously Delivering Security In The Cloud

  • A moving target is harder to hit so regularly create and destroy containers — this helps to ensure an attacker does not gain persistence
  • Secrets can be rotated using a secrets manager like Hashicorp’s Vault

From Microliths To Microsystems

A summary of the talk can be found on InfoQ.

  • Reactive Design Principles help when developing microservice architectures
  • Separate stateless behaviour from stateful entities in a microservice architecture — this helps with scaling
  • References to Pat Helland’s paper “Data on the Outside versus Data on the Inside”
  • Use Domain Driven Development, but don’t focus on the things (nouns), instead focus on what happens (events)
  • CQRS and Event Sourcing mentioned as a useful techniques

Building a Data Science Capability From Scratch

Although a good talk consisting of first-hand experiences, there wasn’t a key takeaway for me.

Crushing Tech Debt Through Automation at Coinbase

  • We need to move fast to survive, technical debt slows us down
  • Build in automated guardrails to ensure that we can move fast without bringing down production
  • Everyone at Coinbase can deploy a tested master branch to production
  • Focus on building a blameless culture
  • Create archetype projects to speed up creation of new services
  • Do security checks in the build pipeline
  • Hashicorp’s Vault mentioned again here
  • Coinbase uses Docker — no container lasts for more than 30 days

That wraps it up for Day 1. Day 2 can be found here and Day 3 can be found here.

Moving from Subversion to Git

Most of the teams I’ve worked with in the past have moved repositories from Subversion to Git. Almost always because they’ve started off with Subversion, but wanted to follow a workflow which was considered easier for larger teams to work together. They wanted to “scale”.

Did Git deliver on this? Well, yeah actually. Now, it’s worth mentioning that both Subversion and Git are solid version control systems. Does Subversion really fall that far short of Git? The later versions of Subversion do solve some of the pain points of the earlier versions, but time and time again I’ve seen choices made to go with Git’s distributed paradigm over Subversion’s centralised paradigm.

Subversion on the left, Git on the right.

In fact, just last week I worked with a team who had traditionally used Subversion, but had made the choice to start their greenfield project using Git. I delivered a short, one-hour workshop with the intention of helping a Developer with a background in Subversion make the move to Git.

The Workshop

The workshop can be found here.

The workshop started with a ten-minute exercise in which the team represented their current Subversion workflow using cards and arrows stuck onto a whiteboard. If there was a step not covered by the cards, blank cards were provided, but in this case the cards covered all the steps the team usually takes. This was a pretty interesting activity to observe and gives a workflow to compare to that of the proposed Git workflow later in the presentation.

Following the exercise I talked through the presentation, breaking for questions and the practical exercises. Something to note here is when is comes to the practical exercises, be well acquainted with the permissions of your Git repository! Unfortunately, during this delivery not everyone was able to access the repository that had been created and were not able to complete the practical exercises on their machine. The demo on the projector sufficed, but was nowhere near as engaging as actually “doing the thing”. Fortunately, the presentation being freely available gives the opportunity to come back later.

Questions

Some questions asked after the presentation:

What’s the best Git repository naming convention?

Although there’s no real right answer here, I like to use lowercased, hyphen-separated names. As a general rule I also like names to start high level and where it makes sense, get more detailed. For example:

${PROJECT_NAME}-${REPO_PURPOSE}-${LANGUAGE}

practice-ml-linear-regression-python
workshop-svn-to-git
workshop-intro-to-scala

This has the benefit that you can scan through a list of projects pretty quickly and find what you are looking for. When you list the contents of the directory containing your cloned projects, all the like projects are grouped together.

What should I name my branches?

Again, there’s no right answer. However, if you’re using a bug tracker like Github’s issue tracker, Atlassian’s Jira, etc you will likely have an issue number like “abc-123″; this makes a perfect branch name. Some people like to provide more of a context and name the branch for their feature, such as “fix-request-timeout”. Some like to do both “abc-123-fix-request-timeout“.

How long should feature branches live for?

This one is a tough one, because a feature branch can be periodically updated from and merged into a main branch, but still exist without causing much of a problem. A good answer might be “for as long as it makes sense”. Merged your feature branch and closed your ticket? Delete the feature branch. Merged your feature branch but not yet completely done implementing your feature? No harm in keeping it open to work on later.

How often should I update my feature branch from the main branch?

I would say right before raising a pull request, at a minimum. Again this depends on how often your team commits to the main branch. Churning out five closed pull requests an hour? Pull very often, like every 15 minutes. Closing two or three pull requests a day? Pull perhaps once an hour, or less frequently. Remember, frequent pulls help to avoid nasty merge conflicts!

The diagram shows a master, dev and feature branch. What’s the difference between a master branch and a dev branch?

You mean this diagram?

branches

This, again, depends on your workflow. In a team doing Continuous Deployment you might say that everything that gets onto the master branch gets into production. So, by having a dev branch we allow the team to integrate and test their features before the big push. What if you take cuts from a main branch and release that? You could get away with just having a master branch.

Conclusion

Overall the workshop was well received and started the ball rolling for a room of Developers to learn Git. Feel free to use the material to deliver your own workshop. Found a bug or typo? Please, raise a pull request in Github. Have any questions, feedback? Leave a comment below.

Intro to Scala

A little while ago I designed and delivered a short workshop with the goal of introducing the Scala programming language to Developers from a Java background.

As far as programming languages go, Scala is in my top 5. The designers of Scala saw the value in the Functional Programming (FP) paradigm and looked to marry it with the Object Orientated (OO) paradigm from day one. This approach has only recently been adopted in Java 8 through the Streams API.

So if Java’s going functional, why learn Scala at all? There’s a lot of advanced language features that Scala offers that Java does not. Scala is also completely interoperable with your Java projects. Not only that, but Scala is a nice entry point for some of the “purer” functional languages such as Haskell and Erlang.

The code for the workshop is available here.

Think you found a bug in the code? Feel free to raise a Github issue. Have questions? Feel free to leave a comment below!