SCARF: Capturing and Animation of Body and Clothing
from Monocular Video

1Max Planck Institute for Intelligent Systems, 2ETH, Zurich

SIGGRAPH Asia 2022 (Conference Proceedings)

Given a monocular video (a), our method (SCARF) builds an avatar where the body and clothing are disentangled (b). The body is represented by a traditional mesh, while the clothing is captured by an implicit neural representation. SCARF enables animation with detailed control over the face and hands (c) as well as clothing transfer between subjects (d).

Abstract

We propose SCARF (Segmented Clothed Avatar Radiance Field), a hybrid model combining a mesh-based body with a neural radiance field. Integrating the mesh into the volumetric rendering in combination with a differentiable rasterizer enables us to optimize SCARF directly from monocular videos, without any 3D supervision. The hybrid modeling enables SCARF to (i) animate the clothed body avatar by changing body poses (including hand articulation and facial expressions), (ii) synthesize novel views of the avatar, and (iii) transfer clothing between avatars in virtual try-on applications. We demonstrate that SCARF reconstructs clothing with higher visual quality than existing methods, that the clothing deforms with changing body pose and body shape, and that clothing can be successfully transferred between avatars of different subjects.

Method Overview

SCARF takes monocular RGB video and clothing segmentation masks as input, and outputs a human avatar with separate body and clothing layers. Blue letters indicate optimizable modules or parameters.

Results

Once the avatar is built, we can animate it with detailed control over face and hands. We can alter the body shape, the clothing will adapt accordingly. We also transfer the clothing from other trained videos to the given subject.


Reconstruction

We run Marching cube to extract the mesh from trained clothing NeRF, and show it with the explicitly learned body mesh. Green indicates the mesh part extracted from NeRF-based clothing.

Press R to reset views.

Acknowledgement

We thank Sergey Prokudin, Weiyang Liu, Yuliang Xiu, Songyou Peng, Qianli Ma for fruitful discussions, and Peter Kulits, Zhen Liu, Yandong Wen, Hongwei Yi, Xu Chen, Soubhik Sanyal, Omri Ben-Dov, Shashank Tripathi for proofreading. We also thank Betty Mohler, Sarah Danes, Natalia Marciniak, Tsvetelina Alexiadis, Claudia Gallatz, and Andres Camilo Mendoza Patino for their supports with data. This work was partially supported by the Max Planck ETH Center for Learning Systems.
Disclosure. MJB has received research gift funds from Adobe, Intel, Nvidia, Meta/Facebook, and Amazon. MJB has financial interests in Amazon, Datagen Technologies, and Meshcapade GmbH. While MJB is a part-time employee of Meshcapade, his research was performed solely at, and funded solely by, the Max Planck Society. While TB is part-time employee of Amazon, this research was performed solely at, and funded solely by, MPI.

BibTeX

@inproceedings{Feng2022scarf,
    author = {Feng, Yao and Yang, Jinlong and Pollefeys, Marc and Black, Michael J. and Bolkart, Timo},
    title = {Capturing and Animation of Body and Clothing from Monocular Video},
    year = {2022},
    booktitle = {SIGGRAPH Asia 2022 Conference Papers},
    articleno = {45},
    numpages = {9},
    location = {Daegu, Republic of Korea},
    series = {SA '22}
}