3D video meeting backgrounds

TL;DR

I built a prototype where your iPhone pose controls the camera inside a generated 3D splat scene, composited with a live segmented person, and published as a LiveKit video track to be used as your camera feed.

World Labs is building generative 3D models. At the time of this project, their main product is Marble. With Marble you can generate Gaussian splat scenes from text, images, or video.

Online meeting backgrounds peaked during COVID. They are okay when you remain static, but they are not realistic because there is no depth in the image. I'm not a 3d artist, but with Marble I don't need to be to generate something useful. So I wanted to see if those worlds could become actually usable video-call backgrounds, where the background reacts to your motion instead of being a flat image.

Does camera motion even feel right?

I expected scene scale to be the main show stopper, but in my tests, the scenes felt close to real-world scale (e.g. ~1 m of movement mapped to ~1 m in-scene). So I started with the most uncertain part: controlling the scene camera with the iPhone pose.

It shouldn't be too hard though, given iOS has ARKit and we simply need a coordinate transform. But I couldn't find a solid iOS native splat renderer. I initially planned on using MetalSplatter, but during my tests on an iPhone 16 Pro a mere 30 MB scene took over a minute to initialise.

There are no other 3DGS renderers for iOS. Yes, I was also surprised. It's a relatively new 3D representation, so eventually.

There is a great web-based engine, spark.js, which loads the same file in under a second.

Using spark.js means having a web view within the Swift app, and it does not come for free. Web view rendering happens out of process, and interprocess communication is constrained by the process boundary. That means we can't share memory and we have to serialize data for interprocess communication.

This is not a biggie for streaming the iPhone pose 3D coordinate, but it will be a problem for camera frames.

Once this was accomplished, I got half the work done, and we could control the spark.js camera with the iPhone pose:

Demo: iPhone motion drives the scene camera.

Camera frames

So now we need to tap into the camera frames. I first tried to initiate the camera on the JS side to avoid dealing with interprocess communication of camera frames, but that was not possible as the camera can't be used simultaneously with an ARKit session running.

So I had to bite the bullet. Pushing frames as fast as the camera produces them is a no-go: the JS side would quickly be overwhelmed, as its rendering loop works at a different rate and we would need to deal with backpressure, queuing, and frame buildup.

So I flipped the model*:

  • Swift is the producer: takes front camera frames, segments the person, stores recent frames in a small ring buffer, and exposes a resourceURL so that consumers can fetch.
  • JS is the consumer: at each render tick, it pulls the latest available frame via resourceURL.
  • Bridging still has cost (decode + upload + texture updates). Pulling prevents runaway queues, but it's not zero-copy.

Final pipeline

At a high level, the whole system is:

  1. Swift side uses ARKit to track device pose and pushes it to the JS side.
  2. JS applies the coordinate transform and updates the spark.js camera.
  3. In parallel, Swift segments the person from camera frames and makes it available to JS via a resourceURL.
  4. JS/WebGL fetches frames via resourceURL, uploads them as a texture, and composites the final scene.
  5. JS publishes canvas.captureStream() to LiveKit as a video track.
← Back to projects