Colorize point cloud: Polars vs PDAL

Author

Simone Massaro

Published

March 18, 2025

Here I am comparing polars and PDAL for adding a color information on a point clouds. Polars is reading a writing a point cloud transformed into parquet format. PDAL is reading and writing a point cloud in LAZ format.

The operations are roughly the same and should be comparable.

For the computation part Polar is about 13x faster than PDAL.

However when we add also writing the point cloud to disk, the is overhead (my guess becuse writing parquet can’t be done in parallel) resulting in Polars being only 2x faster.

Those are not rigourous benchmarks, but just a quick exploration 1.

Note: Polars does use multiple core, but the fact that I can do it transparently is a big plus.

tiny file

import polars as pl
from polars import col as c
from pyprojroot import here
from pdal import Pipeline, Writer, Filter, Reader
from pathlib import Path
cloud = pl.scan_parquet(here("data/test1.parquet"))
start_r, start_g, start_b = 0, 0, 255
end_r, end_g, end_b = 255, 0, 0
cloud_color = cloud.with_columns(
    z_norm = (c.z - c.z.min()) / (c.z.max() - c.z.min())
).with_columns([
    (start_r + (end_r - start_r) * c.z_norm).cast(pl.Int32).alias("red"),
    (start_g + (end_g - start_g) * c.z_norm).cast(pl.Int32).alias("green"),
    (start_b + (end_b - start_b) * c.z_norm).cast(pl.Int32).alias("blue")
])
cloud_color.collect().write_parquet(here("data/test1_color.parquet"))
2.06 ms ± 29.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cloud.select(pl.len()).collect()
shape: (1, 1)
len
u32
1204

PDAL version

pipeline = Pipeline([
    Reader.las(here("data/test1.laz")),
    Filter.colorinterp(),
    Writer.las(here("data/test1_color.laz"))
])
pipeline.execute()
6.21 ms ± 423 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

big data

path_las = Path("/run/media/simone/Extreme SSD/data/forest-inventory-demo/harv/lidar/processed-lidar/0_raw_lidar/NEON_D01_HARV_DP1_731000_4713000_classified_point_cloud_colorized.copc.laz")
path_parquet = Path("/run/media/simone/Extreme SSD/data/forest-inventory-demo/harv/lidar/processed-lidar/0_raw_lidar/NEON_D01_HARV_DP1_731000_4713000_classified_point_cloud_colorized.parquet")
cloud = pl.scan_parquet(path_parquet)
cloud = cloud.with_columns(
    z_norm = (c.z - c.z.min()) / (c.z.max() - c.z.min())
).with_columns([
    (start_r + (end_r - start_r) * c.z_norm).cast(pl.Int32).alias("red"),
    (start_g + (end_g - start_g) * c.z_norm).cast(pl.Int32).alias("green"),
    (start_b + (end_b - start_b) * c.z_norm).cast(pl.Int32).alias("blue")
])
cloud.collect().write_parquet(path_parquet.stem + "_color.parquet")
CPU times: user 11.3 s, sys: 3.18 s, total: 14.5 s
Wall time: 10.4 s
cloud.collect().write_parquet(path_parquet.stem + "_color.parquet")
10.3 s ± 86.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cloud.select(pl.len()).collect()
shape: (1, 1)
len
u32
20172736
pipeline = Pipeline([
    Reader.las(path_las),
    Filter.colorinterp(),
    Writer.las(path_las.stem + "_color.laz")
])
pipeline.execute()
25.8 s ± 890 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Big file don’t write to disk

cloud = pl.scan_parquet(path_parquet)
cloud = cloud.with_columns(
    z_norm = (c.z - c.z.min()) / (c.z.max() - c.z.min())
).with_columns([
    (start_r + (end_r - start_r) * c.z_norm).cast(pl.Int32).alias("red"),
    (start_g + (end_g - start_g) * c.z_norm).cast(pl.Int32).alias("green"),
    (start_b + (end_b - start_b) * c.z_norm).cast(pl.Int32).alias("blue")
])
cloud.collect();
CPU times: user 1.82 s, sys: 2.7 s, total: 4.52 s
Wall time: 708 ms
cloud.collect()
831 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cloud.select(pl.len()).collect()
shape: (1, 1)
len
u32
20172736
pipeline = Pipeline([
    Reader.las(path_las),
    Filter.colorinterp(),
])
pipeline.execute()
10.9 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.9/.8
13.625

Footnotes

  1. My machine has Intel® Core™ i7-10750H CPU @ 2.60GHz × 12 and 64GB of RAM↩︎