When All You Have

is an

Elephpant

Gemma Lynn / @ellotheth

Hi, I'm Gemma

I write software for a living.

  • Cornell University's NASA research station
  • Postal mailing industry
  • Health research

I'm a WonderMinion

  • WonderProxy is a network of >= 208 proxy/VPN servers spread around the world.
  • Co-founder literally wrote the book on PHP Web APIs.

A Brilliant Idea

What if people could run network diagnostics from our servers?

The Where's It Up? API

Launched in March 2013 with four tests:

  • lookup
  • ping
  • dig
  • trace

https://wheresitup.com

The Where's It Up? API

The Where's It Up? Front-End

Founders were unhappy with Lithium:

  • It was heavy
  • It was confusing
  • It was poorly documented

Is there a better option?

BulletPHP!

The Where's It Up? Task Queue

We're happy with Gearman, and we're still using it today.

The Where's It Up? Workers

Oh the workers. Those little rascals.

Wascally Workers

The Where's It Up? Workers

  • Pulls tests directly from Gearman
  • Processes one test at a time

The Where's It Up? Workers


$gmworker= new GearmanWorker();
$gmworker->addServer();
$gmworker->addFunction("dig", "gearman_dig");
// three more of these for lookup, ping and trace

$m = new Mongo();
$collection = $m->wheresitup->results;
                        

while($gmworker->work()) {
  if ($gmworker->returnCode() != GEARMAN_SUCCESS) {
    echo "return_code: " . $gmworker->returnCode() . "\n";
    break;
  }
}
                        

The Where's It Up? Workers


function gearman_dig($job) {
  global $collection;

  list($server, $url, $workID) = unserialize($job->workload());
  $result = shell_exec("scripts/whereisitup-dig.sh $server $url");

  $collection->update(/* add $result to the $workID job */);
}
                        

The Where's It Up? Workers


#!/bin/sh

ssh $1 dig $2
                        

The Where's It Up? Workers


[program:wheresitup]
command=/usr/bin/php /var/local/wheresitup/worker.php
numprocs=25
process_name=%(program_name)s_%(process_num)02d
directory=/tmp
stdout_logfile=/var/log/supervisor/wheresitup.log
autostart=true
autorestart=true
                        

The Where's It Up? Workers

Evolution

  • Multiplexed SSH connections
  • More efficient MongoDB storage
  • Test monitors
  • More tests!

Worker Limitations

Restarting the workers was ugly.


... wait for a lull ...
$ ps ax|grep wheresitup
$ kill -9 [pid] [pid] [pid] ...
                        

Worker Limitations

25 workers was not cutting it.


* 9a229d5 (2014-04-03 01:43:08 +0000) Will Roberts
|   and bump to 190 workers
* 152f3a7 (2014-04-03 00:07:00 +0000) Will Roberts
    bump number of workers up to 70
* df6ea55 (2013-07-18 19:20:40 +0000) Will Roberts
    bump max hops to 50
* d8ec934 (2013-04-20 20:24:28 +0000) Will Roberts
    bump to 50 workers
                        

Worker Limitations

Memory became a problem after 50 workers.


+  $done = 0;
 
   while( $gmworker->work() ) {
+    $done++;
     if( $gmworker->returnCode() != GEARMAN_SUCCESS ) {
       echo "return_code: " . $gmworker->returnCode() . "\n";
       break;
+    } else if( $done > 200 ) {
+      echo "quitting after 200 jobs\n";
+      break;
     }
                        

Worker Limitations

Supervisord was limited to 1024 file descriptors.

1024 file descriptors = ~200 workers

(This may not be a problem anymore!)

Worker Limitations

  • No cleanup on exit
  • No scaling past ~200 concurrent processes
  • Mystery memory leak

Some of this is solvable!

The Non-Blocking Worker

  • Launched April 2014
  • Used non-blocking streams for concurrent processing

The Non-Blocking Worker


class worker {
    public $readPipe;

    function run($cmd) {
        $descriptorspec = array(
            array("pipe", "r"),                         // STDIN
            array("pipe", "w"),                         // STDOUT
            array("file", "/tmp/error-output.txt", "a") // STDERR
        );
        $this->process = proc_open($cmd, $descriptorspec, $pipes);
        $this->readPipe = $pipes[1];
        stream_set_blocking($this->readPipe, 0);
    }
}
                        

The Non-Blocking Worker


$pipes = array();
foreach($workers as $k => $worker) { // global array of workers
    $pipes[$k] = $work->readPipe;
}

stream_select($pipes, $write, $except, 1, 0);
if ($pipes) { // readable pipes, re-indexed (thanks, PHP)
    foreach($pipes as $k => $stdout) {
        foreach($workers as $index => $worker) {
            if ($stdout === $worker->readPipe) {
                // munge and save the result
                // release the worker and the pipe
            }
        }
    }
}
                        

The Non-Blocking Worker

Concurrency makes everything faster! Yay!

The Non-Blocking Worker

A bunch of the old problems still exist. Boo.

  • Still limited to ~200 concurrent processes
  • Mystery memory leak is still mysterious
  • Ugly restarts are still ugly
We hit some walls really hard.

Perhaps We Need A New Hammer

I was not enthusiastic about introducing another language into our stack.

Meanwhile, I Toured Go

https://tour.golang.org

  • Goroutines!
  • Channels!

Every other "beginning Go" tutorial was "how to build a worker queue"


func main() {
    pipe := make(chan int)

    go doThing(pipe)

    pipe <- 7 // blocks until read
    log.Println(<-pipe) // i == 8
}

func doThing(pipe chan int) {
    i := <-pipe // blocks until written
    i += 1
    pipe <- i
}
                        

Why Go Instead Of X?

Because X is terrible.

JUST KIDDING

Why Go Instead Of X?

  • Goroutines and channels seemed like a good fit
  • I didn't know Rust, and it wasn't stable anyway
  • Paul and Will hate NodeJS with the burning fire of a thousand supernovas
  • The lack of binary dependencies was attractive for deployment
  • Go worked when I prototyped it

The Go Worker

April 2015: Initial prototype

The Go Worker


func NewManager(maxConcurrent int, timeout int) *Manager {
    manager := &Manager{}

    manager.isFinished = make(chan bool)
    manager.stop = make(chan bool)
    manager.newTasks = make(chan Taskable, maxConcurrent)
    manager.doneTasks = make(chan Taskable, maxConcurrent)

    go manager.start(timeout)
    go manager.finish()

    return manager
}
                        

The Go Worker


func (mgr *Manager) start(taskTimeout int) {

    // pull tests off the job queue
    for task := range mgr.newTasks {
        // non-blocking: setup stdin and stderr, start the process
        task.Start()
        // blocking: read stdin and stderr, wait for finish
        go task.Process(taskTimeout, mgr.doneTasks)
    }

    // when there are no more tests, stop
    mgr.stop <- true
}
                        

Results

The More Things Change...

  • Tests are 2-3 seconds faster across the board
  • Memory usage is lower than the memory usage for a single PHP worker
  • Throughput is significantly higher

...the More They Stay the Same

  • The API front-end is still PHP
  • Gearman still works as the job queue
  • Supervisor still manages the workers!

Gophers Among Our Elephpants

Go is now part of WonderProxy's toolkit.

  • We have other backend services!
  • Linguistic cross-pollination
  • Better understanding of potential Go API consumers

Recognize non-nail-like problems.

Embrace them!

Find the right tool, and they'll make you better.

Questions?

Gemma Lynn / @ellotheth