Takeaways from QCon London 2017 – Day 3

Takeaways from QCon London 2017 – Day 3

Here’s day 3. Day 1 can be found here and Day 2 can be found here.

The Talks

  1. Avoiding Alerts Overload From Microservices with Sarah Wells
  2. How to Backdoor Invulnerable Code with Josh Schwartz
  3. Spotify’s Reliable Event Delivery System with Igor Maravic
  4. Event Sourcing on the JVM with Greg Young
  5. Using FlameGraphs To Illuminate The JVM with Nitsan Wakart
  6. This Will Cut You: Go’s Sharper Edges with Thomas Shadwell

Avoiding Alerts Overload From Microservices

  • Actively slim down your alerts to only those for which action is needed
  • “Domino alerts” are a problem in a microservices environment — one service goes down and all dependent services fire alerts
  • Uses Splunk for log aggregation
  • Dashing mentioned for custom dashboards
  • Graphite and Grafana mentioned for metrics
  • Use transaction IDs (uses UUIDs) in the headers of requests to tie them all together
  • Each service to report own health with a standard “health check endpoint”
  • All errors in a service are logged and then graphed
  • Rank the importance of your services – Should you be woken up when service X goes down?
  • Have “Ops Cops” — Developers charged with checking alerts during the day
  • Deliberately break things to ensure alerts are triggered
  • Only services containing business logic should alert

How to Backdoor Invulnerable Code

A highly enjoyable talk of infosec war stories. 

Spotify’s Reliable Event Delivery System

  • The Spotify clients generates an event for each user interaction
  • The system is built on guaranteed message delivery
  • Runs on Google Cloud Platform
  • Hadoop and Hive used on the backend
  • Events are dropped into hourly “buckets”
  • Write it, run it culture
  • System monitoring for:
    • Data monitors – message timeliness SLAs
    • Auditing – 100% delivery
  • Microservices based system
  • Uses Elasticsearch + Kibana
  • Uses CPU based autoscaling with Docker
  • All services are stateless — cloud pub/sub
  • Machines are built with Puppet for legacy reasons
  • Apparently, Spotify experienced a lot of problems with Docker — at least once an hour
  • Services are written in Python
  • Looking to investigate Rocket in future

Event Sourcing on the JVM

  • Event sourcing is inherently functional
  • A single data model is almost never appropriate, event sourcing can feed many and keep them in sync e.g:
    • RDMS
    • NoSQL
    • GraphDB
  • Kafka can be used as an event store by configuring it to persist data for a long time, however this isn’t what it is currently intended to do
  • Event Store mentioned
  • Axon Framework mentioned
    • Mature
  • Eventuate mentioned
    • Great for distributed environments/geolocated data
  • Akka.persistence
    • Great, but needs other Akka libraries
  • Reactive Streams will be a big help when dealing with event sourcing

Using FlameGraphs To Illuminate The JVM

  • Base performance on requirements
  • Flamegraphs come out of Netflix
  • Visualisation of profiled software
  • First must collect Java stacks
  • JVisual VM mentioned
  • Linux Perf mentioned

This Will Cut You: Go’s Sharper Edges

  • It is possible, in some cases, to cause Go to crash through reading (JSON, XML etc) without closing tags —  it just tries to read forever (DOS attack)
  • Go doesn’t have an upload size limit, put your go servers behind a proxy with an upload size limit to mitigate this e.g NGINX, Apache HTTP
  • Go doesn’t have CSRF protection built-in, this must be added manually
  • DNS rebinding attacks may be possible against Go servers

That about wraps it up for my summary QCon London 2017.

 

 

 

Takeaways from QCon London 2017 – Day 2

Takeaways from QCon London 2017 – Day 2

Now for day 2. If you haven’t caught up with Day 1 check it out here, Day 3 can be found here.

The Talks

  1. Deliver Docker Container Continuously in AWS with Philipp Garbe
  2. Continuous Delivery the Hard Way with Kubernetes with Luke Marsden
  3. Low Latency Trading Architecture at LMAX Exchange with Sam Adams
  4. Why We Chose Erlang Over VS. Java, Scala, Go, C with Colin Hemmings
  5. Scaling Instagram Architecture with Lisa Guo
  6. Deep Learning @ Google Scale: Smart Reply In Inbox with Anjuli Kannan

Deliver Docker Container Continuously in AWS

  • A big pro for Amazon EC2 Container Service (ECS) over other container orchestrators is that you don’t have to worry about the cluster state as this is managed for you
  • AWS CloudFormation is the suggested way to manage an ECS cluster, although there was also a mention of Hashicorp’s Terraform
  • Suggested to use Amazon’s Docker registry when using ECS
  • AWS CloudFormat or the CLI suggested for deployment
  • There are 2 load balancers to choose from, the Application Load Balancer (ALB) and the Classic Load Balancer (ELB)
    • The ALB only uses HTTP, but has more features
    • The ELB does HTTPS, but only allows for static port mapping — this only allows for one service per port per VM
    • ALB required for scaling on the same VM
  • Suggested to use load balancing rules based on memory and CPU usage
  • ECS does not yet support newer Docker features, such as health check
  • The Elastic Block Store (EBS) volume is per VM and doesn’t scale that well
  • The Elastic File System (EFS) scales automatically and is suggested
  • You can have granular access controls in ECS by using AWS Identity and Access Management (IAM)
  • Challenges currently exist when using the EC2 metadata service in a container
  • ECS does not support the Docker Compose file
  • ECS does not natively support Docker volumes

Continuous Delivery the Hard Way with Kubernetes (@Weaveworks)

  • Weaveworks use their git master branch to represent production
  • Weave use Gitlab for their delivery pipeline
  • Katacoda was used to give a Kubernetes live demo and it all worked rather well
  • Plug for Weaveworks Flux release manager

Low Latency Trading Architecture at LMAX Exchange

  • Manage to get impressively low latency using “plain Java”
  • Makes great use of the Disruptor Pattern — lots of Ring Buffers
  • Focus on message passing
  • Minimise the amount of network hops
  • Uses in-house hardware, not the cloud
  • Uses async pub-sub using UDP
    • Low latency
    • Scaleable
    • Unreliable
  • Mention of Javassist
  • Stores a lot of things in memory, not the database
  • Describes using an Event Logging approach
  • Java primitives over objects for performance/memory management reasons
  • Makes use of Java Type annotations for type safety
  • Mention of the fastutil library
  • Mention of using the @Contended annotation
  • Uses the commercial JVM Zing for improved garbage collection and performance
  • Mentions manually mapping Java threads to CPU cores using JNI/JNA for increased performance

Why We Chose Erlang Over VS. Java, Scala, Go, C

  • Develops Outlyer monitoring system
  • Version 1 was  a MEAN stack monolith with a Python data collection agent
  • Version 2 was microservices
  • Focus on separating stateless behavior services from stateful services for scaling reasons
  • Uses RabbitMQ for async communication between services
  • Uses Riak for timeseries data
  • Added a Redis cache layer to improve performance
  • Erlang the movie?
  • Erlang uses the Actor Model, let it crash!
  • Erlang has pre-canned “behaviours
  • Erlang has an observer GUI — allows for tracing and interaction with a live application
  • Erlang offers live code reload
  • Erlang has a supposed weird syntax
  • Elixr is a newer language that runs on the Erlang VM — supposedly has a syntax more like Ruby
  • Mentions using DalmatinerDB
  • Outlyer blog for more information

Scaling Instagram Architecture

  • Instagram runs on AWS
  • Python + Django, Cassandra, Postgres, Memcache and RabbitMQ.
    • Postgres for users, media and friendship
    • Cassandra for user feeds and activities
  • Memcache is used to avoid constantly hitting the DB
  • Uses perf_event on Linux
  • Focus on reducing latency
  • Sometimes migrates regularly used code from Python to C for improved performance — Cython mentioned
  • Always-on custom performance analysis in production – a performance hit but for for a lot of insight
  • Disabled the Python garbage collector for performance reasons
  • Uses Python asyncio where there was previously sequential service calls
  • Uses Tao linked DB
  • 40-60 deploys per day to 20,000+ servers with each deploy taking about 10 minutes!

Deep Learning @ Google Scale: Smart Reply In Inbox

  • Google Inbox’s smart reply learns all responses from data
  • Explains deep learning concepts using a feed forward network which uses logistic regression and gradient descent
  • Data fed into the network must be a vector of floats
    • All words in a dictionary are given an numerical index
    • A string of words is converted into a vector of its equivalent numerical representation
    • Dimensionality reduction is employed to produce an “embedded vector” — essentially a lossy compression algorithm
  • Deep Learning models allow for automatic feature learning
  • See the paper Distributed Representations of Words and Phrases and their Compositionality
  • A Recurrent Neural Network is used by this project and by natural language processing in general
  • See the paper A Neural Conversational Model
  • The project can’t match tone — yet.
  • A whitelist is used to prevent bad language
  • The current approach doesn’t always give diverse answers
  • Google Translate uses a very similar model
  • Mentions colah.github.io for more information

That wraps up day 2, here’s day 3.

Takeaways from QCon London 2017 – Day 1

Takeaways from QCon London 2017 – Day 1

As a longtime reader of InfoQ, QCon has always been on my radar. In its 11th year, I managed to attend QCon London. For those that haven’t heard of QCon, it’s a top-notch software development conference, held all around the world. It’s spread over three days with each day broken into tracks. Each track has a particular focus, such as “Architecting for Failure”, “Containers: State of the Art” or my personal favourite “Security: Lessons Learned From Being Pwned”.

Overall, I found the conference to be of a superbly high quality. The talks were relevant and well delivered, the venue ideal and the food abundant. What follows are my takeaways from day 1.

The Talks

  1. Strategic Code Deletion with Michael Feathers
  2. Using Quality Views To Tackle Tech Debt with Colin Breck of Tesla
  3. Continuously Delivering Security In The Cloud with Casey West
  4. From Microliths To Microsystems with Jonas Bonér
  5. Building a Data Science Capability From Scratch with Victor Hu
  6. Crushing Tech Debt Through Automation at Coinbase with Rob Witloff

Strategic Code Deletion

  • If you don’t already have it, a test coverage metric is a great place to start measuring tech debt
  • Mutation testing is an even better way of measuring true test coverage, but the tooling is pretty lacking
  • If code has very little value, consider removing it
  • Mentions of using Martin Fowler’s “Strangler Application” approach

Using Quality Views To Tackle Tech Debt

I recommend that you checkout Colin’s blog post for the details on Quality Views.

Continuously Delivering Security In The Cloud

  • A moving target is harder to hit so regularly create and destroy containers — this helps to ensure an attacker does not gain persistence
  • Secrets can be rotated using a secrets manager like Hashicorp’s Vault

From Microliths To Microsystems

A summary of the talk can be found on InfoQ.

  • Reactive Design Principles help when developing microservice architectures
  • Separate stateless behaviour from stateful entities in a microservice architecture — this helps with scaling
  • References to Pat Helland’s paper “Data on the Outside versus Data on the Inside”
  • Use Domain Driven Development, but don’t focus on the things (nouns), instead focus on what happens (events)
  • CQRS and Event Sourcing mentioned as a useful techniques

Building a Data Science Capability From Scratch

Although a good talk consisting of first-hand experiences, there wasn’t a key takeaway for me.

Crushing Tech Debt Through Automation at Coinbase

  • We need to move fast to survive, technical debt slows us down
  • Build in automated guardrails to ensure that we can move fast without bringing down production
  • Everyone at Coinbase can deploy a tested master branch to production
  • Focus on building a blameless culture
  • Create archetype projects to speed up creation of new services
  • Do security checks in the build pipeline
  • Hashicorp’s Vault mentioned again here
  • Coinbase uses Docker — no container lasts for more than 30 days

That wraps it up for Day 1. Day 2 can be found here and Day 3 can be found here.