import polars as pl
from polars import col as c
from pyprojroot import here
from pdal import Pipeline, Writer, Filter, Reader
from pathlib import PathColorize point cloud: Polars vs PDAL
Here I am comparing polars and PDAL for adding a color information on a point clouds. Polars is reading a writing a point cloud transformed into parquet format. PDAL is reading and writing a point cloud in LAZ format.
The operations are roughly the same and should be comparable.
For the computation part Polar is about 13x faster than PDAL.
However when we add also writing the point cloud to disk, the is overhead (my guess becuse writing parquet can’t be done in parallel) resulting in Polars being only 2x faster.
Those are not rigourous benchmarks, but just a quick exploration 1.
Note: Polars does use multiple core, but the fact that I can do it transparently is a big plus.
tiny file
cloud = pl.scan_parquet(here("data/test1.parquet"))start_r, start_g, start_b = 0, 0, 255
end_r, end_g, end_b = 255, 0, 0cloud_color = cloud.with_columns(
z_norm = (c.z - c.z.min()) / (c.z.max() - c.z.min())
).with_columns([
(start_r + (end_r - start_r) * c.z_norm).cast(pl.Int32).alias("red"),
(start_g + (end_g - start_g) * c.z_norm).cast(pl.Int32).alias("green"),
(start_b + (end_b - start_b) * c.z_norm).cast(pl.Int32).alias("blue")
])cloud_color.collect().write_parquet(here("data/test1_color.parquet"))2.06 ms ± 29.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cloud.select(pl.len()).collect()| len |
|---|
| u32 |
| 1204 |
PDAL version
pipeline = Pipeline([
Reader.las(here("data/test1.laz")),
Filter.colorinterp(),
Writer.las(here("data/test1_color.laz"))
])pipeline.execute()6.21 ms ± 423 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
big data
path_las = Path("/run/media/simone/Extreme SSD/data/forest-inventory-demo/harv/lidar/processed-lidar/0_raw_lidar/NEON_D01_HARV_DP1_731000_4713000_classified_point_cloud_colorized.copc.laz")
path_parquet = Path("/run/media/simone/Extreme SSD/data/forest-inventory-demo/harv/lidar/processed-lidar/0_raw_lidar/NEON_D01_HARV_DP1_731000_4713000_classified_point_cloud_colorized.parquet")cloud = pl.scan_parquet(path_parquet)
cloud = cloud.with_columns(
z_norm = (c.z - c.z.min()) / (c.z.max() - c.z.min())
).with_columns([
(start_r + (end_r - start_r) * c.z_norm).cast(pl.Int32).alias("red"),
(start_g + (end_g - start_g) * c.z_norm).cast(pl.Int32).alias("green"),
(start_b + (end_b - start_b) * c.z_norm).cast(pl.Int32).alias("blue")
])cloud.collect().write_parquet(path_parquet.stem + "_color.parquet")CPU times: user 11.3 s, sys: 3.18 s, total: 14.5 s
Wall time: 10.4 s
cloud.collect().write_parquet(path_parquet.stem + "_color.parquet")10.3 s ± 86.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cloud.select(pl.len()).collect()| len |
|---|
| u32 |
| 20172736 |
pipeline = Pipeline([
Reader.las(path_las),
Filter.colorinterp(),
Writer.las(path_las.stem + "_color.laz")
])pipeline.execute()25.8 s ± 890 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Big file don’t write to disk
cloud = pl.scan_parquet(path_parquet)
cloud = cloud.with_columns(
z_norm = (c.z - c.z.min()) / (c.z.max() - c.z.min())
).with_columns([
(start_r + (end_r - start_r) * c.z_norm).cast(pl.Int32).alias("red"),
(start_g + (end_g - start_g) * c.z_norm).cast(pl.Int32).alias("green"),
(start_b + (end_b - start_b) * c.z_norm).cast(pl.Int32).alias("blue")
])cloud.collect();CPU times: user 1.82 s, sys: 2.7 s, total: 4.52 s
Wall time: 708 ms
cloud.collect()831 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cloud.select(pl.len()).collect()| len |
|---|
| u32 |
| 20172736 |
pipeline = Pipeline([
Reader.las(path_las),
Filter.colorinterp(),
])pipeline.execute()10.9 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.9/.813.625
Footnotes
My machine has Intel® Core™ i7-10750H CPU @ 2.60GHz × 12 and 64GB of RAM↩︎