Javascript in Biotech

One of the fun things about working at CrowdProcess is meeting the amazing developers who work on bleeding edge computational challenges. One of them, Bruno Vieira, has been working in bringing bioinformatics into the world of Javascript.


Besides his work in helping us with genomics in the browser, he has also been working on bionode.io, a Node.js library for client and server side bioinformatics.


His reasoning is that having bioinformatics methods that can run on the browser could immediately benefit biological visualisation projects like BioJS or web genome browsers. It would also allow using those methods on browser based distributed computing grids like CrowdProcess. Thanks to Node.js, the same code could be run on the server to perform some tasks, which are currently performed with other bio libraries, such as Biopython or BioRuby. In this sense, one can think of bionode as an underscore.js for bioinformatics.


We fully encourage developers interested in collaborating with Bruno in bionode (lovers of JavaScript + Biology) to get in touch with him directly (project submitions finish this Friday), and remind everyone that CrowdProcess is free for awesome scientists and researchers like Bruno.

Free for scientists and researchers

Over a year ago we had a vision. A world where scientists, researchers, and those working for the benefit of humanity could have access to supercomputing resources, for free. 

Since then, a lot has happened:

  • We connected millions of different devices, ran huge computations, and helped developers work on incredible projects, from catastrophe simulation to genomics.

  • We launched an enterprise version, and started working with companies on leveraging their own resources.

  • We were joined by some of the most awesome investors, who have become core members of our young team (more on this very soon).

But at a personal level, few things can compare to being able to say the following:
We welcome scientists and researchers into the CrowdProcess platform. For free.

Genetic Algorithms on CrowdProcess

Computer programs that evolve in ways that resemble natural selection can solve complex problems even their creators do not fully understand." - John Henry Holland

A few months ago, one of our users took on the fascinating task of implementing a genetic algorithm for optimization of job scheduling on CrowdProcess.

Flowshop Scheduling Program

In a rather crude nutshell, genetic algorithms act the same way as nature: they take a group of candidate solutions and allow them to mutate, reproduce, and crossover. They then keep the best results (as defined by a fitness function) from one generation to the next. For more on genetic algorithms, we recommend this book by David E. Goldberg

Genetic algorithms are, by their nature, prone to parallelism. Going back to the parallel (no pun intended) with nature, the more candidates and generations you have, the more likely one of them with reach a better optimum. In nature, that optimum is simply the ability to survive.

So how does this overlap with a distributed computing platform made of thousands of web browsers? Extremely well, it turns out. The solution is to make each web browser an independent population, run each simulation in isolation from all others, and return the best result from each one. These can then be compared locally, and the “best of the best” chosen as the optimum.

In the case of the current problem, the objective was to find the order of jobs that gave the fastest completion time of the whole production cycle. The problem is essential for production management, as proper planning of jobs can lead to significant savings in production costs.

Applications such as telecommunications routing, financial forecasting or fleet logistics use Genetic algorithms, and we hope that the CrowdProcess platform can bring them into more common use.

The source code is on this link, and we encourage developers to try it out, and to run Genetic algorithms on thousands of web browsers in parallel.

Our heartfelt Thank You to Jerzy Duda from AGH University of Science and Technology in Krakow, Poland for providing us an initial code of GA and the test problems.

We are very interested in supporting developers who would like to run GA algorithms on CrowdProcess for their own use cases, and potentially add functionality such as editable fitness functions, and control over parameters such as population size, crossover type, mutation type, etc.

If you are working with GA algorithms feel free to get in touch.

A few vaguely interesting numbers

CrowdProcess is a very particular distributed computing platform. It has browsers connecting and disconnecting all the time, and their number varies considerably during the day.

So how are job times affected by these variations? We decided to run a very basic experiment, which give a (non scientific) feel for this. If you want to know how we did it, read on. If you are more interested in the cool graphs, scroll down. 

Here is what we did: 
We took a simple Run function, which uses a Monte Carlo simulation to calculate pi with 1000000000 points, and returns only the time it took to calculate. In node.js, each run takes about 12 and a half seconds.
   

function Run() {
   var inQuarterCircle = 0,
   n = 1000000000,
   i = n;
   timer= Date.now();
      while(i—) {
         if (Math.pow(Math.random(),2 ) + Math.pow(Math.random(),2 ) < 1) {
            inQuarterCircle++;
      }
   }
   var pi=4*inQuarterCircle / n;
   return (Date.now()-timer)/1000;
}

Next, we made 4 json files, and called them (to be very original), small, medium, large and huge. Each one got a different number of empty objects, corresponding to the number of tasks to be run (our function does not take an input in this case)

small.json:          2,080 tasks
medium.json:    10,000 tasks
large.json:          20,000 tasks
huge.json:          60,000 tasks

Excellent. Now it remained only to run them on the platform. During the different runs on the platform (across multiple days), the number of browsers varied  considerably, as follows:
image

So it was important to control for number of browsers, and each experiment was run multiple times. 

Each job we sent was returned with the times that each task took to execute. The interesting thing was to determine how different browsers took different times to run the same task. 

The average time on the local run, in node, on a Toshiba L50 (Intel® Core™ i5-3230M Dual Core) was 12.771 seconds. 

On the browsers, the distribution looked more like this:

image

This is an example from a small.json. Interestingly, a full 27,85% of browsers outperformed the local run (72.12% took longer). If you are wondering about the long tail, 3.3% crossed the 1 minute mark, and none past the two minute mark (because of a platform timeout at 2 minutes).

Interesting… now on to the experiments themselves! If a single task would take 12.771 seconds, and a small.json has 2080 objects, then the expected sequentially run would take a bit over 7 hours. To the platform! 

image

Not bad, the worse result was a 172x speedup, and the best was 288x. Time going down pleasantly linearly with the number of browsers. 

Beyond, to the medium jobs! 10k tasks, expected time 35.7 hours. 

image

Again, beautifully linear. as expected, and speedups ranging between 240x and 517x. Not much of a challenge for a distributed computing platform though…

So up again, to 20k tasks. (expected time: 70.9 hours).

image

All linear, except for a massive outlier. The most likely answer is that the platform was being shared by multiple developers running different tasks at the same time. Speedups at a respectable minimum 238x and maximum 643x.

Finally, the run corresponding to the ‘huge’ file, with 60k tasks (“huge is a major overstatement, the platform has run jobs orders of magnitude higher, but it sounded like the natural thing after “large”).

Expected time for the “huge” file was 212 hours (a bit over a week sequentially).
image

Max speedup at 755x, and a minimum at 268x. One question springs to mind, what happens if we plot speedup vs number of browsers? 

Well, this happens: 

image

Interestingly, the job sizes have a clear impact on speedup, and not only on time of computation, even though the tasks often outweigh the number of nodes by more than an order of magnitude.

Which begs the question… What happens to speedup per browser, with the number of browsers?

image

Speedup per browser went up to almost 0,5 with a large number of tasks on a small number of nodes, but clearly decreases as the number of nodes increases. It seems comparatively always higher on larger jobs.

So what’s next? Well, today we ran a job with 1 million tasks in 13.11 minutes, with a speedup of over 3000x. We will be publishing more on that next week so follow the blog, or try out the platform yourself (which is probably an even better idea). 

Scheduler improvements

CrowdProcess began February with a very considerable improvement in the scheduler, and consequent improvements in platform performance. With this we have been able to reduce the processing time per job over 29%.

image
Now that we’ve told you the good news, let us explain how we came to this result, and how we compare results on a browser powered distributed computer:

As you might know the number of browsers connected to CrowdProcess’ platform is volatile by design. We therefore ran experiments when the number of browsers was fairly stable, ranging between 1435 and 1588.

We used a sample of  >300 jobs for each experiment: one using the old scheduler, and the other using the new one. Next, we calculated the accumulated average time per job, as the number of jobs increased.

We used a dummy Job, in which each task would wait 2 seconds before returning 1.  Each job had 1500 tasks. (more about jobs and tasks on CrowdProcess) We cleared the data of outliers, defined as results over 2 standard deviations from the mean.  We then plotted the accumulated average job time, which is what you can see on the graph.  

In the process, we have been collecting data that will enable us to further improve the CrowdProcess platform. We will be sharing new findings here, so follow our blog or register at our platform so that we can keep you updated.

Helping you debug yourself

We are currently working with two experts in Bioinformatics: Prof. Jonas Almeida, a researcher at the University of Alabama at Birmingham (UAB) and Prof. Alexandre Francisco from INESC in Portugal. We developed with a browser powered version of a Microbiome Sequence Alignment tool.

"Microbiome Sequence Alignment?! What’s that? Please tell me, CrowdProcess, I want to know!”, you say.

So, imagine you put the content of your stomach into a sequencing machine - a machine that translates genes into genetic code (ACGTs) - and you get a “book” of your stomach contents. A book written in a sentence using only four letters (ACGT). Gibberish.

You then get your dictionary of ACGT-ish - a database that matches known organisms with the respective sequences - and you read it to find out what microorganisms, bacteria and bugs are populating the contents of your stomach.

Now that you know what Microbiome Sequence Alignment is, you might be wondering what a distributed computing platform powered by web browsers has to do with it.

Here’s your answer: in the development of the work we are doing with Prof. Jonas Almeida and Prof. Alexandre Francisco, we’ve made a simple demo just to show what’s possible with CrowdProcess.

CrowdProcess Genomes

We made this because we believe that massively parallel computing obtained through web browsers can pave the way to a future where distributed computing is accessible to everyone and anyone. That and the ease of use of JavaScript can truly bring (computing) power to the people. Aditionally, by using CrowdProcess behind the firewall, we want to help hospitals and clinics bring genome sequencing to everybody.

Check the demo and let us know what you think in the comments. Alternatively feel free to get in touch with us directly if you want to use it.

It was never so easy to compute on 2000+ nodes

We want to recycle the internet’s wasted computing power to help solve humanity’s toughest challenges, and that’s only going to happen if we all do it together.

That means that every developer should be able to use CrowdProcess. How can we accomplish that ? Well, a simple API sure is a start, and even better, an extremely easy to use Node.js module:

Pretty simple right ? Especially if you’re familiar with async’s map. It does the same thing, except instead of using your computer, it runs in more than 2000 browsers at the same time.

Besides taking an array and calling back with the buffered results, it’s also a handy Duplex stream:

I particularly love line 32.

It was never so simple and easy to run something in 2000+ nodes. And we’ll keep trying to democratize access to computing power. It’s now up to you to use it for the greater good.

Have an idea ? Go run it on 2000+ nodes!

A Merry Christmas from CrowdProcess

To our users, our clients, our partners, our investors, our advisors, our startup fellows at Seedcamp and Startup Lisboa, to all our friends:

One of these days, Pedro Afonso, one of our awesome developers, said that we were “making the future happen sooner”, by speeding up distributed computing applications such as healthcare diagnostic tools, so that “our friends could live longer and we could spend more time with them”.

That quite sums up our mission: making the future happen sooner and giving everyone more time with their loved ones.

But our mission only has purpose if we all enjoy the time we have together.


So, with that in mind, enjoy the time you have with your loved ones, wherever you are.

Merry Christmas,


The CrowdProcess Team

• • •

This post was written by Tiago dos Santos Carlos, Communications Strategist at CrowdProcess.

image

Follow @correcto & @CrowdProcess on twitter for more stuff like this and the occasional random tweet about everything and nothing in particular.

• • •

You know what uses more processing power from your browser than CrowdProcess?

Playing a video about it.

Let’s Make The Future Happen Sooner

image

A few months ago we shared with you a vision: a world where a tiny bit of every browser’s processing power could be brought together to achieve a greater good.

By connecting our users browsers to our platform we gathered enough power to develop amazing real-life problem-solving applications like genome sequencing and forest fire behavioral prediction.

We used these applications to stress test the platform and we now believe it to be ready for anyone to try it out.

For developers, the platform could become the world’s most powerful, accessible and versatile supercomputer.

As of today, November 15, every developer who wants to use the platform will be given 48 hours of parallel processing time - browser hours - to try it out.

For free.

Let’s make the future happen sooner.