alxolr

posts about software engineering craft

How to process a CSV file five times faster in NodeJs with Rust and Napi rs

How to process a CSV file five times faster in NodeJs with Rust and Napi rs

Intro

This tutorial will teach us how to process a CSV file in NodeJs with Rust and Napi rs. We will use the Rust programming language to speed up the processing of a CSV file and create a native nodejs extension using the Napi rs library.

CSV file

For this tutorial, I have used the following CSV File.

Age and sex by ethnic group (grouped total responses), for census usually resident population counts, 2006, 2013, and 2018 Censuses (RC, TA, SA2, DHB), CSV zipped file, 103 MB)

Once you unzip, you will get a couple of files, the biggest one being ~900Mb;

Nodejs processing

I did not use special libraries to process the file to avoid blaming the library for the performance issues.

So for nodejs, I have used the readline module, which is part of the nodejs core.

const readline = require("readline");

// we will sum the last column of the CSV file
let sum = 0;
let isHeader = true;

const lineReader = readline.createInterface({
  input: process.stdin,
});

lineReader
  .on("line", (line) => {
    // we need to skip the first line which is the header
    if (isHeader) {
      isHeader = false;
      return;
    }

    // ource csv has a comma as delimiter
    const fields = line.trimEnd().split(",");
    // we get the last column and parse the value to integer
    const last = parseInt(fields[fields.length - 1]);

    // there are acouple of lines with broken values we should ignore those
    if (!isNaN(last)) { 
      sum += last;
    }
  })
  .on("close", () => {
    console.log("sum", sum);
  });

I've run the following command to record the metrics:

 cat ~/Documents/csv_huge_data.csv | pv | node index.js

The result was:

cat ~/Documents/csv_huge_data.csv | pv | node index.js 
 817MiB 0:00:10 [80.1MiB/s] 

sum 3345553228

You read as The script processed the file in 10 seconds. And the maximum throughput was 80.1MiB/s.

pv - Pipe Viewer will show you the throughput and the time it took to process the data. Ideal for benchmarking and profiling when you deal with streams.

Rust processing

I have used the following code for the Rust processing, which has the same logic as the Nodejs version.

use std::io::{self, BufRead};

fn main() {
    let mut sum = 0;
    let io = io::stdin();
    let mut handler = io.lock();

    let mut is_header = true;
    loop {
        let mut line = String::new();

        // we read the lines from stdin until the buffer is empty
        let bytes_read = handler.read_line(&mut line).unwrap();

        if bytes_read == 0 {
            break;
        }

        // same as in Nodejs we need to skip the first line
        if is_header {
            is_header = false;
            continue;
        }

        // we get the last column and parse the value to integer
        let res = line
            .trim_end()
            .split(",")
            .last()
            .unwrap()
            .parse::<f32>() // some values are as floats but we still parse everything to int
            .unwrap_or(0.0) as i64;

        sum += res;
    }

    println!("sum {}", sum);
}

After running the following command:

cat ~/Documents/csv_huge_data.csv | pv | cargo run --release

The crucial moment is to run with --release flag; otherwise, the performance will be much worse.

   Compiling nodejs_vs_rust_stream v0.1.0 (/home/alxolr/Work/rust_land/nodejs_vs_rust_stream)
    Finished release [optimized] target(s) in 0.17s
     Running `target/release/nodejs_vs_rust_stream`
 817MiB 0:00:02 [ 366MiB/s]

sum 3345553228

We can notice Rust version is five times faster than the Nodejs version. And the throughput was 366MiB/s.

Now a logical question appears: What to do if we already have a giant Nodejs codebase? We can't just move to Rust!

There is a way to use Rust in Nodejs, and that is by using Napi rs.

Napi rs

Napi-rs is a library that allows you to create Nodejs modules in Rust. It is a wrapper around the Napi C library.

Napi code is compiled into a dynamic library that Nodejs can load. So you can use Napi to create Nodejs modules in C/C++ or Rust.

In order to generate a new Napi module, you need to install the Napi rs cli tool:

npm install -g @napi-rs/cli

Then you can create a new module using the following command:

napi new async_csv_reader

Detailed insturctions can be found here.

Once you generate the project rust code it's expected to be in the src/lib.rs file.

// /src/lib.rs

#![deny(clippy::all)]

use std::{
  fs::File,
  io::{self, BufRead},
  path::Path,
};

use napi::{bindgen_prelude::AsyncTask, JsNumber, Task};

#[macro_use]
extern crate napi_derive;


// we want our function to be a promise to be executed asynchronously to not block the event loop in nodejs
// and for that we need to create this weird structs AsyncReadCsv and impl the Task trait for it.

#[napi]
pub fn read_csv_async(path: String) -> AsyncTask<AsyncReadCsv> {
  AsyncTask::new(AsyncReadCsv { path })
}

pub struct AsyncReadCsv {
  path: String,
}

impl Task for AsyncReadCsv {
  type Output = i64;

  type JsValue = JsNumber;

  fn compute(&mut self) -> napi::Result<Self::Output> {
    Ok(read_csv(self.path.clone()))
  }

  fn resolve(&mut self, env: napi::Env, output: Self::Output) -> napi::Result<Self::JsValue> {
    env.create_int64(output)
  }
}


// this is the main function that receive the path to the csv file 
// and start processing the data line by line
fn read_csv(path: String) -> i64 {
  let lines = read_lines(Path::new(&path)).unwrap();

  let mut sum = 0;

  for line in lines {
    if let Ok(ip) = line {
      let res = ip
        .trim_end()
        .split(",")
        .last()
        .unwrap()
        .parse::<f32>()
        .unwrap_or(0.0) as i64;

      sum += res;
    }
  }

  sum
}

// useful function to read the lines from a file
fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where
  P: AsRef<Path>,
{
  let file = File::open(filename)?;
  Ok(io::BufReader::new(file).lines())
}

After you run npm run build in your project, you will get a index.js and index.d.ts file that you can call from nodejs.

The test added for our exported function read_csv_async which become camelCased in the javascript code: readCsvAsync.

import test from "ava";

import { readCsvAsync } from "../index.js";

test("sum from native", async (t) => {
  let path = "~/Documents/csv_huge_data.csv";
  let result = await readCsvAsync(path);

  t.assert(result === 3345553228);
});

The result of running the test was

 read_csv@0.0.0 test
> ava

  ✔ sum from native (1.7s)
  ─

  1 test passed

We can execute rust binary code in nodejs with zero overhead which is an incredible superpower.

Conclusion

For the parts of your nodejs code that are CPU intensive and you need to process a lot of data, it is better to use Rust, and you can create a native extension and call it from Nodejs.

This combination of Rust and nodejs is compelling and leverages the best of both worlds.


I hope that this article was helpful. If you like it, please share it with your friends and leave a comment; I will gladly answer all the questions.

Related articles

How to test locally AWS SQS queues in node.js

How to test locally AWS SQS queues in node.js

846. Hand of Straights solved in rust

846. Hand of Straights solved in rust

648. Replace words solved in rust

648. Replace words solved in rust

×