Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to run with 12g VRAM? Is there a way to optimize it #31

Open
libai-lab opened this issue Dec 9, 2024 · 36 comments
Open

Comments

@libai-lab
Copy link

Is there a way to run with 12g VRAM? Is there a way to optimize it

@tsjslgy
Copy link

tsjslgy commented Dec 10, 2024

Same

@erosDiffusion
Copy link

erosDiffusion commented Dec 10, 2024

10 would be great ...

actually I have a virtualized docker ubuntu image of this and it worked on my 10gb card (sharing roughly 10gb additional ram memory)
the quality was great on the first run.

@0lento
Copy link

0lento commented Dec 10, 2024

I made a fork that runs on my 8GB GPU:
https://github.com/0lento/TRELLIS/

It loads and unloads different stage models on demand so it doesn't require 16GB always. It can occasionally go OOM but it's pretty random. On windows you can enable Sysmem Fallback from Nvidia drivers and it'll not crash on those cases. This may even work on 6GB GPU's with these changes if you have that fallback enabled, but I can't test this myself.

Note that most of that model management code is written by AI, so there can still be issues with it but it does work for me and few others.

@tsjslgy
Copy link

tsjslgy commented Dec 11, 2024

I made a fork that runs on my 8GB GPU: https://github.com/0lento/TRELLIS/

It loads and unloads different stage models on demand so it doesn't require 16GB always. It can occasionally go OOM but it's pretty random. On windows you can enable Sysmem Fallback from Nvidia drivers and it'll not crash on those cases. This may even work on 6GB GPU's with these changes if you have that fallback enabled, but I can't test this myself.

Note that most of that model management code is written by AI, so there can still be issues with it but it does work for me and few others.

I pulled your fork and tried to run it, but it still resulted in an OOM error. I started investigating the issue. Since I was using xformers, I switched to flash-attn to see if it would help, but the OOM error persisted. At that time, I hadn't set NVIDIA to system fallback, so I made that adjustment, but it still didn't resolve the issue.

For reference, my system is Windows 10 (22H2), with a 3060 (12GB VRAM) and 64GB of RAM. Perhaps I made some mistakes in my setup, and I’m continuing to troubleshoot. However, so far, it hasn't worked for me.

Thank you for sharing your fork; I really appreciate it!

@0lento
Copy link

0lento commented Dec 12, 2024

After reviewing the code, it's pretty far from ideal and has bunch of logic issues, but it let me run this at reasonable speed. I've implemented more structured way to keeping models on CPU and only loading to GPU since but it doesn't actually make any impact since this initial quick hack generation yields similar gains, in fact I think a lot simpler model flush just before the initial mesh generation would have the same impact here. But I won't be pushing new changes to this until I find something more sophisticated that actually makes a difference.

The issue seems to be that the first OOM happens on initial mesh generation and next when you actually simplify and bake the texture and they uses most GPU resources at once, those smaller models used on earlier stages aren't that big so this naive model unloading scheme isn't going to fix everything.

I've found out that details on your generation image seem to matter a lot on VRAM usage, my previous test image with car almost never went OOM but my new test image with cartoon character eating spaghetti does it always.

At that time, I hadn't set NVIDIA to system fallback, so I made that adjustment, but it still didn't resolve the issue.

Make sure the setting got applied, it really shouldn't OOM in normal use after you have it enabled. It should just make everything slower when you go over the limit. Fortunately TRELLIS isn't going over the VRAM amount by huge margin on initial mesh generation so this doesn't make it that much slower.

For reference, my system is Windows 10 (22H2), with a 3060 (12GB VRAM) and 64GB of RAM.

I'm on Win 11 Pro (24H2) with 2070 Super (8GB) and 64GB RAM. I'm using xformers.

There are a ton of desktop apps that just take their own small slice from the GPU VRAM when you have GPU acceleration enabled on them all. I've used Process Explorer for MS to identify these (on my system even mouse software has some hidden AI prompt system running on the background). After cleaning most of the unneeded usage on desktop, I'm at 0.8 GB while writing this. It used to be close to 2GB on desktop when I first started looking into this, every small change counts when you are this close to limit.

Besides the VRAM issue, you can also make the generation a lot faster by omitting extra video gens (if you use the demo gui, just keep the gaussian video if you want to preview results) and by using mesh baking set to fast option and by using 256 texture if you just want something to review quickly. These changes need only minor edits to app.py and postprocessing_utils.py.

@erosDiffusion
Copy link

@0lento

Besides the VRAM issue, you can also make the generation a lot faster by omitting extra video gens (if you use the demo gui, just keep the gaussian video if you want to preview results) and by using mesh baking set to fast option and by using 256 texture if you just want something to review quickly. These changes need only minor edits to app.py and postprocessing_utils.py.

Could you show how to disable the video generation ? I tried but failed.

@0lento
Copy link

0lento commented Dec 12, 2024

Could you show how to disable the video generation ? I tried but failed.

I actually have version of that webui here that does directly give the mesh output with some extras parameters to tweak, but I've been stuck on trying to optimize the gen itself so haven't had time to clean it up.

If you just want to reduce video gens to gaussian alone, that's easy since you just omit these lines:

TRELLIS/app.py

Lines 115 to 116 in ab1b84a

video_geo = render_utils.render_video(outputs['mesh'][0], num_frames=120)['normal']
video = [np.concatenate([video[i], video_geo[i]], axis=1) for i in range(len(video))]

this alone makes it a lot faster step. To fully bypass the video you need some further gradio edits.

@erosDiffusion
Copy link

erosDiffusion commented Dec 12, 2024

@0lento
the video itself does not take too long (roughly one minute on 3080 10gb, but it seems to depend on input size/order of the run)
but disabling the steps as you suggested made it overall a bit faster (thanks!)

I have run the example.py and noticed for me the most time is spent in texturing (not actual mesh generation which is reasonably fast).

it would be interesting to disable that. do you know if it's possible ?

in general there is performance degradation after the first run (on the main gradio app)
first run is quite fast. subsequent runs gets slower and slower. or maybe it's the model complexity. not sure.

edit:
i've modified this file to skip texture baking. it makes it much much faster and i don't need it for my goals) downloading the glb works but the preview is a bit washed out (whiteish to the point you can't see the model well)
postprocessing_utils.py.txt

@0lento
Copy link

0lento commented Dec 12, 2024

edit: i've modified this file to skip texture baking. it makes it much much faster and i don't need it for my goals) downloading the glb works but the preview is a bit washed out (whiteish to the point you can't see the model well) postprocessing_utils.py.txt

Note that if you skip the videos and texturing, you also don't need to create gaussian, only mesh. this doesn't have huge impact on speed. this is on my list to implement with the ui overhaul (I already have texture disable there but it still generates gaussian) but I'm currently looking into making things faster first.

@psychobee
Copy link

edit: i've modified this file to skip texture baking. it makes it much much faster and i don't need it for my goals) downloading the glb works but the preview is a bit washed out (whiteish to the point you can't see the model well) postprocessing_utils.py.txt

Note that if you skip the videos and texturing, you also don't need to create gaussian, only mesh. this doesn't have huge impact on speed. this is on my list to implement with the ui overhaul (I already have texture disable there but it still generates gaussian) but I'm currently looking into making things faster first.
would disabling this work? video = render_utils.render_video(outputs['gaussian'][0], num_frames=120)['color'] which is above these video_geo = render_utils.render_video(outputs['mesh'][0], num_frames=120)['normal']
video = [np.concatenate([video[i], video_geo[i]], axis=1) for i in range(len(video))]

@iliagrigorevdev
Copy link

I made some small changes to make it work with 12gb vram:
iliagrigorevdev@f519063

@psychobee
Copy link

edit: i've modified this file to skip texture baking. it makes it much much faster and i don't need it for my goals) downloading the glb works but the preview is a bit washed out (whiteish to the point you can't see the model well) postprocessing_utils.py.txt

Note that if you skip the videos and texturing, you also don't need to create gaussian, only mesh. this doesn't have huge impact on speed. this is on my list to implement with the ui overhaul (I already have texture disable there but it still generates gaussian) but I'm currently looking into making things faster first.

app_with_noglb.txt
this works i think

@0lento
Copy link

0lento commented Dec 14, 2024

I force-pushed cleaned up version on my repo on this branch now: https://github.com/0lento/TRELLIS/tree/low-vram
I've measured it taking 6-8GB cuda memory on use but it depends a lot on image and generated latent structure complexity. For example too low sample step amounts can consume way more memory (talking about something like 2-3 steps).

@psychobee
Copy link

I force-pushed cleaned up version on my repo on this branch now: https://github.com/0lento/TRELLIS/tree/low-vram I've measured it taking 6-8GB cuda memory on use but it depends a lot on image and generated latent structure complexity. For example too low sample step amounts can consume way more memory (talking about something like 2-3 steps).

it worked on my 4 gb but took 23 minutes . but i included @erosDiffusion not baking textures and not rendering gaussian and normal video ,like in text i uploaded 12 steps of bothe Image
Image
both worked and were detailed

@0lento
Copy link

0lento commented Dec 14, 2024

it worked on my 4 gb but took 23 minutes

4GB huh? well, you could make it slightly faster on my fork by changing this line:
https://github.com/0lento/TRELLIS/blob/a4cafa627564f770f86a179e3263acc553a243a8/trellis/pipelines/trellis_image_to_3d.py#L304

into:

        self.unload_models(['sparse_structure_flow_model','sparse_structure_decoder', 'slat_flow_model', 'slat_decoder_mesh', 'slat_decoder_gs', 'slat_decoder_rf'])

These additional models were left from initial unload because it's just extra shuffling with GPU's that have 8+ GB (and these latent structure related models don't even take the whole VRAM on 8GB GPU.

This won't make it fast on that VRAM amount but I guess every little can help.

@PladsElsker
Copy link

In my case, I have a slow GPU with 24GB VRAM, and a faster GPU with 10GB VRAM. I wanted to document my findings.
As it turns out, if you are using xformers like I am, you can't just switch the GPU mid-run because of this absurd issue that is not yet fixed:
facebookresearch/xformers#1064

And so, to work around that, you have to use torch.cuda.set_device and torch.set_default_device to actually switch the device.

In the run function, the only line that actually requires more than 10GB is this one:
https://github.com/microsoft/TRELLIS/blob/ab1b84a18ecc6610b2656026f78866aa2643631b/trellis/pipelines/trellis_image_to_3d.py#L283C10-L283C47

The rest of the ENTIRE pipeline can be run on my 10GB card.

@0lento
Copy link

0lento commented Dec 15, 2024

In the run function, the only line that actually requires more than 10GB is this one: https://github.com/microsoft/TRELLIS/blob/ab1b84a18ecc6610b2656026f78866aa2643631b/trellis/pipelines/trellis_image_to_3d.py#L283C10-L283C47

The rest of the ENTIRE pipeline can be run on my 10GB card.

This is why we've linked solutions here to unload the models from gpu prior to this line. You can fit this to 10GB just fine if you unload the models that have already been used at this point (can load them back again on new runs).

@PladsElsker
Copy link

PladsElsker commented Dec 16, 2024

In the run function, the only line that actually requires more than 10GB is this one: https://github.com/microsoft/TRELLIS/blob/ab1b84a18ecc6610b2656026f78866aa2643631b/trellis/pipelines/trellis_image_to_3d.py#L283C10-L283C47
The rest of the ENTIRE pipeline can be run on my 10GB card.

This is why we've linked solutions here to unload the models from gpu prior to this line. You can fit this to 10GB just fine if you unload the models that have already been used at this point (can load them back again on new runs).

Yes.
I guess my main point is that if you have 2 or more GPUs and want to change the pipeline's GPU at runtime while using xformers, you can do so, but you need to set the default device as well because of a bug in xformers lib. Sorry if that was not clear.

Because I have 2 GPUs, I don't have to offload anything to the CPU, which saves some time.

This isn't exactly the same issue. I wanted to document this xformer bug somewhere for anyone else trying to switch the GPU of some pipeline's models, and I did not want to open a new issue for that. I thought this issue was close enough to talk about it.

@MontagueM
Copy link

for interest, https://github.com/MontagueM/TRELLIS is a fork with some simple data type changes and makes flexicubes use f16 - I've found no significant visual difference in quality between f16 and f32
average run is ~4.5GB VRAM used

@No1Idle
Copy link

No1Idle commented Dec 19, 2024

Thanks to all! it finally starts to work on my 12gb card. However sometimes I have an OOM in the bake_texture routine. Anyone has any idea on how to manage the memory in that routine? (Except skipping that part)

@PladsElsker
Copy link

I have an OOM in the bake_texture routine. Anyone has any idea on how to manage the memory in that routine?

I found that for a large amount of triangles, you need the simplify slider to be very high to not OOM, even on 24GB cards. This happens to me too for large landscapes when simplify = 0, but does not happen for characters. Surely, there's a way to split the triangle batch and merge it later to take less memory and more time.

@cronobjs
Copy link

cronobjs commented Dec 19, 2024

@MontagueM Your Fork is the only one that I was able to get this far. But Im not sure How to get passed this. ```[SPARSE] Backend: spconv, Attention: flash_attn
Please install kaolin and diso to use the mesh extractor.
Warp 1.5.0 initialized:
CUDA Toolkit 12.6, Driver 12.6
Devices:
"cpu" : "Intel64 Family 6 Model 151 Stepping 5, GenuineIntel"
"cuda:0" : "NVIDIA GeForce RTX 3060" (12 GiB, sm_86, mempool enabled)
"cuda:1" : "NVIDIA GeForce RTX 3050" (8 GiB, sm_86, mempool enabled)
CUDA peer access:
Not supported
Kernel cache:
C:\Users\Crono\AppData\Local\NVIDIA\warp\Cache\1.5.0
D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\gradio_client\utils.py:1097: UserWarning: file() is deprecated and will be removed in a future version. Use handle_file() instead.
warnings.warn(
[SPARSE][CONV] spconv algo: native
[ATTENTION] Using backend: flash_attn
Initializing image conditioning model 'dinov2_vitl14_reg'.
Using cache found in C:\Users\Crono/.cache\torch\hub\facebookresearch_dinov2_main
A matching Triton is not available, some optimizations will not be enabled
Traceback (most recent call last):
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\xformers_init_.py", line 57, in _is_triton_available
import triton # noqa
ModuleNotFoundError: No module named 'triton'
C:\Users\Crono/.cache\torch\hub\facebookresearch_dinov2_main\dinov2\layers\swiglu_ffn.py:43: UserWarning: xFormers is available (SwiGLU)
warnings.warn("xFormers is available (SwiGLU)")
C:\Users\Crono/.cache\torch\hub\facebookresearch_dinov2_main\dinov2\layers\attention.py:27: UserWarning: xFormers is available (Attention)
warnings.warn("xFormers is available (Attention)")
C:\Users\Crono/.cache\torch\hub\facebookresearch_dinov2_main\dinov2\layers\block.py:33: UserWarning: xFormers is available (Block)
warnings.warn("xFormers is available (Block)")
[VRAM] After initializing image_cond_model 'dinov2_vitl14_reg': Allocated: 0.00 GB, Reserved: 0.00 GB
[VRAM] After loading pretrained pipeline: Allocated: 0.00 GB, Reserved: 0.00 GB
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
[VRAM] After preprocessing image: Allocated: 3.96 GB, Reserved: 3.98 GB
Unloading model 'sparse_structure_flow_model' to CPU.
Unloading model 'sparse_structure_decoder' to CPU.
Unloading model 'slat_flow_model' to CPU.
Unloading model 'slat_decoder_mesh' to CPU.
Unloading model 'slat_decoder_gs' to CPU.
Unloading model 'slat_decoder_rf' to CPU.
[VRAM] After unloading models: ['sparse_structure_flow_model', 'sparse_structure_decoder', 'slat_flow_model', 'slat_decoder_mesh', 'slat_decoder_gs', 'slat_decoder_rf']: Allocated: 1.13 GB, Reserved: 1.19 GB
[VRAM] After unloading decoders at start: Allocated: 1.13 GB, Reserved: 1.19 GB
(2.59s) Getting conditional info of image...
[VRAM] Before encoding image: Allocated: 1.13 GB, Reserved: 1.19 GB
[VRAM] After transforming image: Allocated: 1.14 GB, Reserved: 1.19 GB
Loading model 'image_cond_model' to CUDA.
[VRAM] After loading 'image_cond_model': Allocated: 1.14 GB, Reserved: 1.19 GB
[VRAM] After encoding image: Allocated: 1.16 GB, Reserved: 1.25 GB
Unloading model 'image_cond_model' to CPU.
[VRAM] After unloading models: ['image_cond_model']: Allocated: 0.02 GB, Reserved: 0.04 GB
[VRAM] After getting conditioning information: Allocated: 0.02 GB, Reserved: 0.04 GB
[VRAM] After setting seed and getting conditioning: Allocated: 0.02 GB, Reserved: 0.04 GB
(3.69s) Sampling sparse structure...
[VRAM] Before sampling sparse structure: Allocated: 0.02 GB, Reserved: 0.04 GB
Loading model 'sparse_structure_flow_model' to CUDA.
[VRAM] After loading 'sparse_structure_flow_model': Allocated: 1.07 GB, Reserved: 1.08 GB
Sampling: 100%|████████████████████████████████████████████████████████████████████████| 12/12 [00:07<00:00, 1.53it/s]
[VRAM] After sampling sparse structure: Allocated: 1.07 GB, Reserved: 1.23 GB
Loading model 'sparse_structure_decoder' to CUDA.
[VRAM] After loading 'sparse_structure_decoder': Allocated: 1.21 GB, Reserved: 1.26 GB
Unloading model 'sparse_structure_decoder' to CPU.
[VRAM] After unloading models: ['sparse_structure_decoder']: Allocated: 1.07 GB, Reserved: 1.09 GB
Unloading model 'sparse_structure_flow_model' to CPU.
[VRAM] After unloading models: ['sparse_structure_flow_model']: Allocated: 0.02 GB, Reserved: 0.04 GB
(13.15s) Sampling structured latent...
[VRAM] Before sampling structured latent: Allocated: 0.02 GB, Reserved: 0.04 GB
Loading model 'slat_flow_model' to CUDA.
[VRAM] After loading 'slat_flow_model': Allocated: 1.14 GB, Reserved: 1.15 GB
Sampling: 100%|████████████████████████████████████████████████████████████████████████| 12/12 [00:08<00:00, 1.40it/s]
Unloading model 'slat_flow_model' to CPU.
[VRAM] After unloading models: ['slat_flow_model']: Allocated: 0.02 GB, Reserved: 0.04 GB
[VRAM] After sampling structured latent: Allocated: 0.02 GB, Reserved: 0.04 GB
[VRAM] After normalizing structured latent: Allocated: 0.02 GB, Reserved: 0.04 GB
(22.98s) Decoding structured latent...
SLAT shape: torch.Size([1, 8])
(22.99s) Decoding gaussian...
Loading model 'slat_decoder_gs' to CUDA.
[VRAM] After loading 'slat_decoder_gs': Allocated: 0.19 GB, Reserved: 0.20 GB
Unloading model 'slat_decoder_gs' to CPU.
[VRAM] After unloading models: ['slat_decoder_gs']: Allocated: 0.05 GB, Reserved: 0.18 GB
[VRAM] After decoding gaussian: Allocated: 0.05 GB, Reserved: 0.18 GB
(23.75s) Decoding mesh...
Loading model 'slat_decoder_mesh' to CUDA.
[VRAM] After loading 'slat_decoder_mesh': Allocated: 0.22 GB, Reserved: 0.24 GB
(0.0) SLatMeshDecoder.forward: x.shape=torch.Size([1, 8])
(0.4116373062133789) SLatMeshDecoder.forward: h.shape=torch.Size([1, 768]), upsample: 2
(0.41) SLatMeshDecoder.forward: block 0: SparseSubdivideBlock3d(
(act_layers): Sequential(
(0): SparseGroupNorm32(32, 768, eps=1e-05, affine=True)
(1): SparseSiLU()
)
(sub): SparseSubdivide()
(out_layers): Sequential(
(0): SparseConv3d(
(conv): SubMConv3d(768, 192, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.Native)
)
(1): SparseGroupNorm32(32, 192, eps=1e-05, affine=True)
(2): SparseSiLU()
(3): SparseConv3d(
(conv): SubMConv3d(192, 192, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.Native)
)
)
(skip_connection): SparseConv3d(
(conv): SubMConv3d(768, 192, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.Native)
)
)
(0.73) SLatMeshDecoder.forward: block 0 h.shape=torch.Size([1, 192])
(0.74) SLatMeshDecoder.forward: block 1: SparseSubdivideBlock3d(
(act_layers): Sequential(
(0): SparseGroupNorm32(32, 192, eps=1e-05, affine=True)
(1): SparseSiLU()
)
(sub): SparseSubdivide()
(out_layers): Sequential(
(0): SparseConv3d(
(conv): SubMConv3d(192, 96, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.Native)
)
(1): SparseGroupNorm32(32, 96, eps=1e-05, affine=True)
(2): SparseSiLU()
(3): SparseConv3d(
(conv): SubMConv3d(96, 96, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.Native)
)
)
(skip_connection): SparseConv3d(
(conv): SubMConv3d(192, 96, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.Native)
)
)
(1.35) SLatMeshDecoder.forward: block 1 h.shape=torch.Size([1, 96])
(1.373729944229126) SLatMeshDecoder.forward: h.shape=torch.Size([1, 96]), out_layer: SparseLinear(in_features=96, out_features=101, bias=True)
(1.3959858417510986) SLatMeshDecoder.forward: out h.shape=torch.Size([1, 101])
Traceback (most recent call last):
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\gradio\queueing.py", line 536, in process_events
response = await route_utils.call_process_api(
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\gradio\route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\gradio\blocks.py", line 1935, in process_api
result = await self.call_function(
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\gradio\blocks.py", line 1520, in call_function
prediction = await anyio.to_thread.run_sync( # type: ignore
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\anyio\to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\anyio_backends_asyncio.py", line 2505, in run_sync_in_worker_thread
return await future
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\anyio_backends_asyncio.py", line 1005, in run
result = context.run(func, *args)
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\gradio\utils.py", line 826, in wrapper
response = f(*args, **kwargs)
File "D:\Trellis imgto3d\TRELLIS\app.py", line 100, in image_to_3d
outputs = pipeline.run(
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "D:\Trellis imgto3d\TRELLIS\trellis\pipelines\trellis_image_to_3d.py", line 406, in run
decoded = self.decode_slat(slat, start_time, formats)
File "D:\Trellis imgto3d\TRELLIS\trellis\pipelines\trellis_image_to_3d.py", line 316, in decode_slat
ret['mesh'] = mesh_decoder(slat)
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Trellis imgto3d\TRELLIS\trellis\models\structured_latent_vae\decoder_mesh.py", line 203, in forward
reprr = self.to_representation(h)
File "D:\Trellis imgto3d\TRELLIS\trellis\models\structured_latent_vae\decoder_mesh.py", line 167, in to_representation
x.data = x.data.replace_feature(x.feats.to(torch.float16))
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\spconv\pytorch\core.py", line 203, in replace_feature
new_spt = SparseConvTensor(feature, self.indices, self.spatial_shape,
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\torch\fx_symbolic_trace.py", line 106, in call
cls.init(instance, *args, **kwargs) # type: ignore[misc]
File "D:\Trellis imgto3d\TRELLIS\venv\lib\site-packages\spconv\pytorch\core.py", line 163, in init
assert indices.dtype == torch.int32, "only support int32"
AssertionError: only support int32```

@MontagueM
Copy link

MontagueM commented Dec 19, 2024

@cronobjs what pytorch version do you have? run python -c "import torch; print(torch.__version__)" in terminal

updated fork to avoid hitting this anyway, its a tiny vram improvement

@No1Idle
Copy link

No1Idle commented Dec 20, 2024

I have an OOM in the bake_texture routine. Anyone has any idea on how to manage the memory in that routine?

I found that for a large amount of triangles, you need the simplify slider to be very high to not OOM, even on 24GB cards. This happens to me too for large landscapes when simplify = 0, but does not happen for characters. Surely, there's a way to split the triangle batch and merge it later to take less memory and more time.

Yes i also noticed that it depends on the triangles in the non simplified model. For now i avoid the texture baking for "large" models using the wax material. It is really fast. I also added a parameter to switch between baking and non baking. If I really need the texture i bake them on a low resolution model and transfer to the high resolution one in blender.

@cjjkoko
Copy link

cjjkoko commented Dec 20, 2024

@MontagueM Can you adapt multiple views?

@0lento
Copy link

0lento commented Dec 23, 2024

Thanks to all! it finally starts to work on my 12gb card. However sometimes I have an OOM in the bake_texture routine. Anyone has any idea on how to manage the memory in that routine? (Except skipping that part)

change this line:
https://github.com/microsoft/TRELLIS/blob/main/trellis/utils/postprocessing_utils.py#L450
to:
texture_size=texture_size, mode='fast',
It's going to consume way less VRAM and is many times faster, at the expense of quality I suppose. Also if you run this from cli script and not from UI, just do del pipeline after pipeline.run and before postprocessing_utils.to_glb and it frees a lot of VRAM for the texturing stage.

@gmf1982
Copy link

gmf1982 commented Dec 26, 2024

I made a fork that runs on my 8GB GPU: https://github.com/0lento/TRELLIS/

It loads and unloads different stage models on demand so it doesn't require 16GB always. It can occasionally go OOM but it's pretty random. On windows you can enable Sysmem Fallback from Nvidia drivers and it'll not crash on those cases. This may even work on 6GB GPU's with these changes if you have that fallback enabled, but I can't test this myself.

Note that most of that model management code is written by AI, so there can still be issues with it but it does work for me and few others.

I used your fork,my videocard is 2060 super 8GB, but the following error occurs. Is there any solution? Thanks! :

[SPARSE] Backend: spconv, Attention: xformers
Warp 1.5.0 initialized:
CUDA Toolkit 12.6, Driver 12.6
Devices:
"cpu" : "AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD"
"cuda:0" : "NVIDIA GeForce RTX 2060 SUPER" (8 GiB, sm_75, mempool enabled)
Kernel cache:
C:\Users\fshifi\AppData\Local\NVIDIA\warp\Cache\1.5.0
D:\AI\TRELLIS\venv\lib\site-packages\gradio_client\utils.py:1097: UserWarning: file() is deprecated and will be removed in a future version. Use handle_file() instead.
warnings.warn(
[SPARSE][CONV] spconv algo: auto
[ATTENTION] Using backend: xformers
Using cache found in C:\Users\fshifi/.cache\torch\hub\facebookresearch_dinov2_main
C:\Users\fshifi/.cache\torch\hub\facebookresearch_dinov2_main\dinov2\layers\swiglu_ffn.py:43: UserWarning: xFormers is available (SwiGLU)
warnings.warn("xFormers is available (SwiGLU)")
C:\Users\fshifi/.cache\torch\hub\facebookresearch_dinov2_main\dinov2\layers\attention.py:27: UserWarning: xFormers is available (Attention)
warnings.warn("xFormers is available (Attention)")
C:\Users\fshifi/.cache\torch\hub\facebookresearch_dinov2_main\dinov2\layers\block.py:33: UserWarning: xFormers is available (Block)
warnings.warn("xFormers is available (Block)")
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Sampling: 100%|████████████████████████████████████████████████████████████████████████| 12/12 [00:06<00:00, 1.78it/s]
Sampling: 100%|████████████████████████████████████████████████████████████████████████| 12/12 [00:28<00:00, 2.38s/it]
Traceback (most recent call last):
File "D:\AI\TRELLIS\venv\lib\site-packages\gradio\queueing.py", line 536, in process_events
response = await route_utils.call_process_api(
File "D:\AI\TRELLIS\venv\lib\site-packages\gradio\route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
File "D:\AI\TRELLIS\venv\lib\site-packages\gradio\blocks.py", line 1935, in process_api
result = await self.call_function(
File "D:\AI\TRELLIS\venv\lib\site-packages\gradio\blocks.py", line 1520, in call_function
prediction = await anyio.to_thread.run_sync( # type: ignore
File "D:\AI\TRELLIS\venv\lib\site-packages\anyio\to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "D:\AI\TRELLIS\venv\lib\site-packages\anyio_backends_asyncio.py", line 2505, in run_sync_in_worker_thread
return await future
File "D:\AI\TRELLIS\venv\lib\site-packages\anyio_backends_asyncio.py", line 1005, in run
result = context.run(func, *args)
File "D:\AI\TRELLIS\venv\lib\site-packages\gradio\utils.py", line 826, in wrapper
response = f(*args, **kwargs)
File "D:\AI\TRELLIS\app.py", line 170, in image_to_3d
video = render_utils.render_video(outputs['gaussian'][0], num_frames=120)['color']
File "D:\AI\TRELLIS\trellis\utils\render_utils.py", line 95, in render_video
extrinsics, intrinsics = yaw_pitch_r_fov_to_extrinsics_intrinsics(yaws, pitch, r, fov)
File "D:\AI\TRELLIS\trellis\utils\render_utils.py", line 33, in yaw_pitch_r_fov_to_extrinsics_intrinsics
extr = utils3d.torch.extrinsics_look_at(orig, torch.tensor([0, 0, 0]).float().cuda(), torch.tensor([0, 0, 1]).float().cuda())
AttributeError: module 'utils3d' has no attribute 'torch'

@0lento
Copy link

0lento commented Dec 26, 2024

I used your fork,my videocard is 2060 super 8GB, but the following error occurs. Is there any solution? Thanks! :

AttributeError: module 'utils3d' has no attribute 'torch'

This error isn't specific to my fork's current branch as I bring no further dependencies there at the moment.

I do have faint memory of having to solve this in past too. I have torch 2.5.1+cu124 and utils3d 0.0.2 installed on my end to run this. To match that, you could run this on your virtual env:

pip install utils3d==0.0.2 torch==2.5.1 torchvision --index-url=https://download.pytorch.org/whl/cu124

@gmf1982
Copy link

gmf1982 commented Dec 26, 2024

I used your fork,my videocard is 2060 super 8GB, but the following error occurs. Is there any solution? Thanks! :

AttributeError: module 'utils3d' has no attribute 'torch'

This error isn't specific to my fork's current branch as I bring no further dependencies there at the moment.

I do have faint memory of having to solve this in past too. I have torch 2.5.1+cu124 and utils3d 0.0.2 installed on my end to run this. To match that, you could run this on your virtual env:

pip install utils3d==0.0.2 torch==2.5.1 torchvision --index-url=https://download.pytorch.org/whl/cu124

Thank you for your reply. I solved this problem by:

git clone https://github.com/EasternJournalist/utils3d.git
pip install -e ./utils3d

But there is another problem : ModuleNotFoundError: No module named 'diff_gaussian_rasterization'

@0lento
Copy link

0lento commented Dec 26, 2024

But there is another problem : ModuleNotFoundError: No module named 'diff_gaussian_rasterization'

Do look at what others did here to make it run on windows: #3

@gmf1982
Copy link

gmf1982 commented Dec 26, 2024

But there is another problem : ModuleNotFoundError: No module named 'diff_gaussian_rasterization'

Do look at what others did here to make it run on windows: #3

thanks!

@cronobjs
Copy link

But there is another problem : ModuleNotFoundError: No module named 'diff_gaussian_rasterization'

Do look at what others did here to make it run on windows: #3

thanks!

Im not sure if you got it figured out, but Here is how I managed to get it to work on my 3060. I git cloned with --recurse-submodules, 0lento's Fork of Trellis 8gb Vram. Then i copied the Windows Powershell scripts [*.ps1] from https://github.com/sdbds/TRELLIS-for-windows/ and requirements-uv.txt, then use install-uv-qinglong.ps1 in Powershell. Once finished, enjoy Trellis!

@0lento
Copy link

0lento commented Dec 26, 2024

I pushed update to https://github.com/0lento/TRELLIS/tree/low-vram/ with Jonathan's Marching Cubes PR #89 since it seems to be notably less VRAM hungry than flexicubes.

I've also defaulted this to use fast texture baking path, because it's easier on VRAM.

If you don't want these changes, they are contained on individual commits which you can revert if needed.

Do note that I do want to keep this branch changes minimal and only keep things that lower the VRAM usage. I do have plans for exposing all these in better, more optional way on another branch in the near future.

This low-vram branch is now up-to-date with TRELLIS main branch, including multi-image and gaussian exports.

@cjjkoko
Copy link

cjjkoko commented Dec 26, 2024

I pushed update to https://github.com/0lento/TRELLIS/tree/low-vram/ with Jonathan's Marching Cubes PR #89 since it seems to be notably less VRAM hungry than flexicubes.

I've also defaulted this to use fast texture baking path, because it's easier on VRAM.

If you don't want these changes, they are contained on individual commits which you can revert if needed.

Do note that I do want to keep this branch changes minimal and only keep things that lower the VRAM usage. I do have plans for exposing all these in better, more optional way on another branch in the near future.

This low-vram branch is now up-to-date with TRELLIS main branch, including multi-image and gaussian exports.

Wow, you're one step closer to being commercially viable

@0lento
Copy link

0lento commented Dec 26, 2024

Wow, you're one step closer to being commercially viable

All credits on that for jclarkk. His Gsplat PR also works for texture baking stage, but it doesn't change the vram usage + it's slightly slower to run so haven't yet included it here as I don't want to tweak this branches gradio etc to expose these things proper, just trying to keep everything else as stock as possible here and do bigger changes elsewhere.

@0lento
Copy link

0lento commented Dec 26, 2024

I also pushed another branch that has also Gsplat here https://github.com/0lento/TRELLIS/tree/low-vram-gsplat, I wanted to put this on separate branch because it introduces additional dependency and in my brief testing is bit slower, but this also does have more permissive license so many may want to use this. If you see more artifacts with this branch, try removing the fast bake option before drawing too many conclusions, fast bake is a compromise.

If you already have TRELLIS setup and go to this branch, you need to install this within your TRELLIS python env:
pip install git+https://github.com/nerfstudio-project/gsplat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests