How to process a CSV file five times faster in NodeJs with Rust and Napi rs
Intro
This tutorial will teach us how to process a CSV file in NodeJs with Rust and Napi rs. We will use the Rust programming language to speed up the processing of a CSV file and create a native nodejs extension using the Napi rs library.
CSV file
For this tutorial, I have used the following CSV File.
Age and sex by ethnic group (grouped total responses), for census usually resident population counts, 2006, 2013, and 2018 Censuses (RC, TA, SA2, DHB), CSV zipped file, 103 MB)
Once you unzip, you will get a couple of files, the biggest one being ~900Mb;
Nodejs processing
I did not use special libraries to process the file to avoid blaming the library for the performance issues.
So for nodejs, I have used the readline
module, which is part of the nodejs core.
const readline = require("readline");
// we will sum the last column of the CSV file
let sum = 0;
let isHeader = true;
const lineReader = readline.createInterface({
input: process.stdin,
});
lineReader
.on("line", (line) => {
// we need to skip the first line which is the header
if (isHeader) {
isHeader = false;
return;
}
// ource csv has a comma as delimiter
const fields = line.trimEnd().split(",");
// we get the last column and parse the value to integer
const last = parseInt(fields[fields.length - 1]);
// there are acouple of lines with broken values we should ignore those
if (!isNaN(last)) {
sum += last;
}
})
.on("close", () => {
console.log("sum", sum);
});
I've run the following command to record the metrics:
cat ~/Documents/csv_huge_data.csv | pv | node index.js
The result was:
cat ~/Documents/csv_huge_data.csv | pv | node index.js
817MiB 0:00:10 [80.1MiB/s]
sum 3345553228
You read as The script processed the file in 10 seconds. And the maximum throughput was 80.1MiB/s.
pv - Pipe Viewer will show you the throughput and the time it took to process the data. Ideal for benchmarking and profiling when you deal with streams.
Rust processing
I have used the following code for the Rust processing, which has the same logic as the Nodejs version.
use std::io::{self, BufRead};
fn main() {
let mut sum = 0;
let io = io::stdin();
let mut handler = io.lock();
let mut is_header = true;
loop {
let mut line = String::new();
// we read the lines from stdin until the buffer is empty
let bytes_read = handler.read_line(&mut line).unwrap();
if bytes_read == 0 {
break;
}
// same as in Nodejs we need to skip the first line
if is_header {
is_header = false;
continue;
}
// we get the last column and parse the value to integer
let res = line
.trim_end()
.split(",")
.last()
.unwrap()
.parse::<f32>() // some values are as floats but we still parse everything to int
.unwrap_or(0.0) as i64;
sum += res;
}
println!("sum {}", sum);
}
After running the following command:
cat ~/Documents/csv_huge_data.csv | pv | cargo run --release
The crucial moment is to run with
--release
flag; otherwise, the performance will be much worse.
Compiling nodejs_vs_rust_stream v0.1.0 (/home/alxolr/Work/rust_land/nodejs_vs_rust_stream)
Finished release [optimized] target(s) in 0.17s
Running `target/release/nodejs_vs_rust_stream`
817MiB 0:00:02 [ 366MiB/s]
sum 3345553228
We can notice Rust version is five times faster than the Nodejs version. And the throughput was 366MiB/s.
Now a logical question appears: What to do if we already have a giant Nodejs codebase? We can't just move to Rust!
There is a way to use Rust in Nodejs, and that is by using Napi rs.
Napi rs
Napi-rs is a library that allows you to create Nodejs modules in Rust. It is a wrapper around the Napi C library.
Napi code is compiled into a dynamic library that Nodejs can load. So you can use Napi to create Nodejs modules in C/C++ or Rust.
In order to generate a new Napi module, you need to install the Napi rs cli tool:
npm install -g @napi-rs/cli
Then you can create a new module using the following command:
napi new async_csv_reader
Detailed insturctions can be found here.
Once you generate the project rust code it's expected to be in the src/lib.rs
file.
// /src/lib.rs
#![deny(clippy::all)]
use std::{
fs::File,
io::{self, BufRead},
path::Path,
};
use napi::{bindgen_prelude::AsyncTask, JsNumber, Task};
#[macro_use]
extern crate napi_derive;
// we want our function to be a promise to be executed asynchronously to not block the event loop in nodejs
// and for that we need to create this weird structs AsyncReadCsv and impl the Task trait for it.
#[napi]
pub fn read_csv_async(path: String) -> AsyncTask<AsyncReadCsv> {
AsyncTask::new(AsyncReadCsv { path })
}
pub struct AsyncReadCsv {
path: String,
}
impl Task for AsyncReadCsv {
type Output = i64;
type JsValue = JsNumber;
fn compute(&mut self) -> napi::Result<Self::Output> {
Ok(read_csv(self.path.clone()))
}
fn resolve(&mut self, env: napi::Env, output: Self::Output) -> napi::Result<Self::JsValue> {
env.create_int64(output)
}
}
// this is the main function that receive the path to the csv file
// and start processing the data line by line
fn read_csv(path: String) -> i64 {
let lines = read_lines(Path::new(&path)).unwrap();
let mut sum = 0;
for line in lines {
if let Ok(ip) = line {
let res = ip
.trim_end()
.split(",")
.last()
.unwrap()
.parse::<f32>()
.unwrap_or(0.0) as i64;
sum += res;
}
}
sum
}
// useful function to read the lines from a file
fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where
P: AsRef<Path>,
{
let file = File::open(filename)?;
Ok(io::BufReader::new(file).lines())
}
After you run npm run build
in your project, you will get a index.js
and index.d.ts
file that you can call from nodejs.
The test added for our exported function read_csv_async
which become camelCased in the javascript code: readCsvAsync
.
import test from "ava";
import { readCsvAsync } from "../index.js";
test("sum from native", async (t) => {
let path = "~/Documents/csv_huge_data.csv";
let result = await readCsvAsync(path);
t.assert(result === 3345553228);
});
The result of running the test was
read_csv@0.0.0 test
> ava
✔ sum from native (1.7s)
─
1 test passed
We can execute rust binary code in nodejs with zero overhead which is an incredible superpower.
Conclusion
For the parts of your nodejs code that are CPU intensive and you need to process a lot of data, it is better to use Rust, and you can create a native extension and call it from Nodejs.
This combination of Rust and nodejs is compelling and leverages the best of both worlds.