8 de agosto de 2019

An anecdote: streams, node, cloud and binary search

This was originally posted in [Portuguese]({% post_url 2019-05-09-streamings-node-nuvem-e-busca-binaria %}) in this site. Since this is the first post in English, I’m moving some furniture around. If you see something weird or still not translated, feel free to @ me on twitter or leave a comment.

I’ll tell a story from the day that knowing that algorithm from college (binary search) saved me a lot of time. And beyond that, to provide context, this post became an introduction to streams in node, AWS S3 and lambda. All that using a real-world example, where things are not so cute. If you already know promises, come with me.

Lambda

At my current job, we use the lambda functions from AWS. It’s for a process with a beginning, a middle and an end, and it doesn’t need to be up 24 hours a day. Given an input, it gives back an output (with the occasional side effects) and then it’s over. Amazon charges for each 100ms executed. In our case, we receive a legacy file from our partner with millions of lines with financial transactions to do a series of operations for each one. Once a day. Since we don’t need the all-day availability of a standard server, it’s cheaper to use a lambda function.

Obs: the idea here is not to advocate for AWS specifically, there are equivalent services from other big companies. I chose Amazon because it’s what I know better and use at work.

Imagine when receiving the file, an event is fired running our handler function. So, in the code below, we’re using the function fileEvent() that “understands” the fired event object and exports the URL from the file in a string. Then, it awaits the execution of a promise that handles the file until finally, it returns a success message.

exports.handler = async (event) => {
    const fileURL = fileEvent(event);
    await fileHandler(fileURL);
    return 'Success';
};

Simple Storage Service, or just S3

The idea behind S3 is basically that you have a Hard Drive available on the internet. It can be used to store assets, backups, reports… Well, anything. In our case, as mentioned, a partner uploads a file from a batch to our bucket. This action is the trigger that fires our lambda function. So this file will be in the event object for the lambda function handler argument.

The fileHandler function receives a string argument: the URL from the batch file inside the bucket, something like s3-sa-east-1.amazonaws.com/bucket-name/path/to/file.txt. We use an SDK provided from Amazon to download the file, using the promise getObject:

// access bucket S3 class:
const { S3 } = require('aws-sdk');

const fileHandler = async (fileURL) => {
    const contents = await new S3().getObject(fileURL);
    contents.split('\n').map(async (line) => {
        // do a bunch of network operations for each line in the file
    });
    return 'success';
};

Streams

Imagine now that this file is really big. In the above example, we wait for the promise to be resolved until we iterate over the lines from the file and worst of all, it’s allocating the entire file in memory to do that. To solve that, we need a reading stream: an object that fires an event every time it receives some data. Fortunately, the function getObject from Amazon SDK have another member function called createReadStream that does exactly what we need. If we rewrite our code, it would look something like this:

// this is not exactly a readLine, more will be explained later
const readlineS3 = fileURL => new S3().getObject(fileURL).createReadStream();

// creating a promise that resolves when the stream is done reading data
const fileHandler = (fileURL) => new Promise((resolve) => {
    const rl = readlineS3(fileURL);
    rl.on('data', async (chunk) => {
        // do a bunch of network operations for each line in the file
    });
    rl.on('close', () => {
        console.log('Sucessfully read the file and made the necesary operations');
        resolve('success');
    });
});

But this code doesn’t work, because the stream event from AWS doesn’t exactly provide a line from the file, it just provides a data chunk. Since streams can be mixed together, this can be solved with an npm package called linebyline:

// creates a stream that converts 'chunks' into file lines
const byline = require('linebyline');

const readlineS3 = fileUrl => byline(new S3().getObject(fileUrl)
    .createReadStream());
const fileHandler = (fileUrl) => new Promise((resolve) => {
    const rl = readlineS3(fileUrl);
    rl.on('line', async (line) => {
        // now we have in fact a line from the file
    });
    rl.on('close', () => {
        //...
        resolve('success');
    });
});

Not everything is roses

This being a real-world working example, things can get ugly. You might have noticed that in the line processing bit of code, there was a comment saying “do a bunch of network operations for each line in the file”. The catch is that AWS has a limit for HTTP requests opened at the same time, and each line event is not guaranteed to enter the system in the same order or being sequentially executed in the node js event loop. When we changed to streams, it theoretically would run way faster. Though the file was in fact big enough to cause this limit error. Because of that, we needed to create a maximum size of data to be processed each time (Actually we struggled a little until we got this idea, but here you get only the fun part).

One more time, an npm package came to save us. stream-throttle takes any stream and limits the publication of it’s events according to a rate param, so rewriting our function readlineS3 again:

const { Throttle } = require('stream-throttle');
const readlineS3 = fileRul => byline(new S3().getObject(fileRul)
    .createReadStream()
    .pipe(new Throttle({ rate: 2 * 1024 }))); // 2KB throttle

OK, but what about the binary search?

The 2KB limit above was totally arbitrarily. We can treat it like an inferior limit, for it is a low number and it takes longer to execute. It would be at least 2KB. However, even that this was already working, we needed to minimize the costs, since AWS charges for each 100ms executed. So we needed to discover what was our superior limit, the biggest number that we could safely make all the requests without an error. With always the same amount of data, we ran tests. Multiplying 2KB by 10 (arbitrary again) we got 20KB and tested it. The error happened. FYI with the 2KB throttling the test was taking about 2 minutes to complete.

Now we have exactly the statement of a binary search. We knew an interval that can be treated as an ordered “list”. After this, we could just test the middle of the way to find a new start or end for the “list”. And then, divide and conquer: repeat the process “recursively” with the new values until we find a value that is more cost-effective without testing every possible value.

In code, a really simplified version of it would be:

const binarySearch = (start, end, list, value) => {
    const middle = Math.floor((start + end) / 2);
    if (list[middle] === value) return middle;
    if (list[middle] > value) return binarySearch(middle, end, list, value);
    return binarySearch(start, middle, list, value);
}

And adapting to this crazy idea, since that isn’t a list and who will “run” this search will be my head, would be a little like the code below, within a precision criteria, since there is no stopping condition.

const specialSearch = (start, end) => {
    const middle = ((end - start) / 2) + start;
    if( runWithoutErrors(middle) ) return specialSearch(middle, end);
    // precision criteria...
    return specialSearch(start, middle);
}

The middle of the way between 2 and 20 would be ((20KB - 2KB) / 2) + 2KB = 11KB. Still got errors. This is a new end. ((11KB - 2KB) / 2) + 2KB = 6.5KB and it worked. This was a new start, and so on… 8.75 was getting to an error and 7.25 not. I could go on with more precision, but it was good enough and I put it on the code: { rate: 7.25 * 1024 }. I found an optimized number in 4 attempts. It took about 5 minutes. Imagine how many attempts it would be if I was raising the number manually 2.1KB, 2.2KB…

Most programming blogs around the webs give you super formal tutorials with abstract examples. I’m not saying this is bad, but I miss a more “storytelling” approach, like a happy hour conversation. With real experiences, we can learn a lot. Please tell me if you liked it in the comments ;)