OpenSceneGraph Forum Forum Index OpenSceneGraph Forum
Official forum which mirrors the existent OSG mailing lists. Messages posted here are forwarded to the mailing list and vice versa.
 
   FAQFAQ    SearchSearch    MemberlistMemberlist    RulesRules    UsergroupsUsergroups    RegisterRegister 
 Mail2Forum SettingsMail2Forum Settings  ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
   AlbumAlbum  OpenSceneGraph IRC ChatOpenSceneGraph IRC Chat   SmartFeedSmartFeed 

osgPPU CUDA Example - slower than expected?


 
Post new topic   Reply to topic    OpenSceneGraph Forum Forum Index -> osgPPU [osgPPU]
View previous topic :: View next topic  
Author Message
Thorsten Roth
Guest





PostPosted: Thu Dec 16, 2010 11:25 am    Post subject:
osgPPU CUDA Example - slower than expected?
Reply with quote

Hi,

as I explained in some other mail to this list, I am currently working
on a graph based image processing framework using CUDA. Basically, this
is independent from OSG, but I am using OSG for my example application :-)

For my first implemented postprocessing algorithm I need color and depth
data. As I want the depth to be linearized between 0 and 1, I used a
shader for that and also I render it in a separate pass to the color.
This stuff is then fetched from the GPU to the CPU by directly attaching
osg::Images to the cameras. This works perfectly, but is quite a bit
slow, as you might already have suspected, because the data is also
processed in CUDA kernels later, which is quite a back and forth ;-)

In fact, my application with three filter kernels based on CUDA (one
gauss blur with radius 21, one image subtract and one image "pseudo-add"
(about as elaborate as a simple add Wink) yields about 15 fps with a
resolution of 1024 x 1024 (images for normal and absolute position
information are also rendered transferred from GPU to CPU here).

So with these 15 frames, I thought it should perform FAR better when
avoiding that GPU <-> CPU copying stuff. That's when I came across the
osgPPU-cuda example. As far as I am aware, this uses direct mapping of
PixelBuferObjects to cuda memory space. This should be fast! At least
that's what I thought, but running it at a resolution of 1024 x 1024
with a StatsHandler attached shows that it runs at just ~21 fps, not
getting too much better when the cuda kernel execution is completely
disabled.

Now my question is: Is that a general (known) problem which cannot be
avoided? Does it have anything to do with the memory mapping functions?
How can it be optimized? I know that, while osgPPU uses older CUDA
memory mapping functions, there are new ones as of CUDA 3. Is there a
difference in performance?

Any information on this is appreciated, because it will really help me
to decide wether I should integrate buffer mapping or just keep the
copying stuff going :-)

Best Regards
-Thorsten


------------------
Post generated by Mail2Forum
Back to top
Thorsten Roth
Guest





PostPosted: Thu Dec 16, 2010 11:31 am    Post subject:
osgPPU CUDA Example - slower than expected?
Reply with quote

By the way: There are two CUDA-capable devices in the computer, but I
have tried using the rendering device as well as the "CUDA-only" device
-> no difference!

-Thorsten

Am 16.12.2010 12:25, schrieb Thorsten Roth:
Quote:
Hi,

as I explained in some other mail to this list, I am currently working
on a graph based image processing framework using CUDA. Basically, this
is independent from OSG, but I am using OSG for my example application :-)

For my first implemented postprocessing algorithm I need color and depth
data. As I want the depth to be linearized between 0 and 1, I used a
shader for that and also I render it in a separate pass to the color.
This stuff is then fetched from the GPU to the CPU by directly attaching
osg::Images to the cameras. This works perfectly, but is quite a bit
slow, as you might already have suspected, because the data is also
processed in CUDA kernels later, which is quite a back and forth ;-)

In fact, my application with three filter kernels based on CUDA (one
gauss blur with radius 21, one image subtract and one image "pseudo-add"
(about as elaborate as a simple add Wink) yields about 15 fps with a
resolution of 1024 x 1024 (images for normal and absolute position
information are also rendered transferred from GPU to CPU here).

So with these 15 frames, I thought it should perform FAR better when
avoiding that GPU <-> CPU copying stuff. That's when I came across the
osgPPU-cuda example. As far as I am aware, this uses direct mapping of
PixelBuferObjects to cuda memory space. This should be fast! At least
that's what I thought, but running it at a resolution of 1024 x 1024
with a StatsHandler attached shows that it runs at just ~21 fps, not
getting too much better when the cuda kernel execution is completely
disabled.

Now my question is: Is that a general (known) problem which cannot be
avoided? Does it have anything to do with the memory mapping functions?
How can it be optimized? I know that, while osgPPU uses older CUDA
memory mapping functions, there are new ones as of CUDA 3. Is there a
difference in performance?

Any information on this is appreciated, because it will really help me
to decide wether I should integrate buffer mapping or just keep the
copying stuff going :-)

Best Regards
-Thorsten



------------------
Post generated by Mail2Forum
Back to top
Thorsten Roth
Guest





PostPosted: Thu Dec 16, 2010 11:33 am    Post subject:
osgPPU CUDA Example - slower than expected?
Reply with quote

Ok..I correct this: There is a difference of ~1 frame Wink ...now I will
stop replying to my own messages :D

Am 16.12.2010 12:31, schrieb Thorsten Roth:
Quote:
By the way: There are two CUDA-capable devices in the computer, but I
have tried using the rendering device as well as the "CUDA-only" device
-> no difference!

-Thorsten

Am 16.12.2010 12:25, schrieb Thorsten Roth:
Quote:
Hi,

as I explained in some other mail to this list, I am currently working
on a graph based image processing framework using CUDA. Basically, this
is independent from OSG, but I am using OSG for my example application
:-)

For my first implemented postprocessing algorithm I need color and depth
data. As I want the depth to be linearized between 0 and 1, I used a
shader for that and also I render it in a separate pass to the color.
This stuff is then fetched from the GPU to the CPU by directly attaching
osg::Images to the cameras. This works perfectly, but is quite a bit
slow, as you might already have suspected, because the data is also
processed in CUDA kernels later, which is quite a back and forth ;-)

In fact, my application with three filter kernels based on CUDA (one
gauss blur with radius 21, one image subtract and one image "pseudo-add"
(about as elaborate as a simple add Wink) yields about 15 fps with a
resolution of 1024 x 1024 (images for normal and absolute position
information are also rendered transferred from GPU to CPU here).

So with these 15 frames, I thought it should perform FAR better when
avoiding that GPU <-> CPU copying stuff. That's when I came across the
osgPPU-cuda example. As far as I am aware, this uses direct mapping of
PixelBuferObjects to cuda memory space. This should be fast! At least
that's what I thought, but running it at a resolution of 1024 x 1024
with a StatsHandler attached shows that it runs at just ~21 fps, not
getting too much better when the cuda kernel execution is completely
disabled.

Now my question is: Is that a general (known) problem which cannot be
avoided? Does it have anything to do with the memory mapping functions?
How can it be optimized? I know that, while osgPPU uses older CUDA
memory mapping functions, there are new ones as of CUDA 3. Is there a
difference in performance?

Any information on this is appreciated, because it will really help me
to decide wether I should integrate buffer mapping or just keep the
copying stuff going :-)

Best Regards
-Thorsten




------------------
Post generated by Mail2Forum
Back to top
J.P. Delport
Guest





PostPosted: Mon Jan 03, 2011 10:00 am    Post subject:
osgPPU CUDA Example - slower than expected?
Reply with quote

Hi,

I don't have any other suggestions than to use a GL debugger to make
sure nothing is going to CPU or to try the new CUDA functions in osgPPU
or your own code. I remember something in the GL to CUDA stuff bugging
me, but cannot remember the details. AFAIR something was converting from
texture to PBO and then to CUDA mem.

jp

On 16/12/10 13:25, Thorsten Roth wrote:
Quote:
Hi,

as I explained in some other mail to this list, I am currently working
on a graph based image processing framework using CUDA. Basically, this
is independent from OSG, but I am using OSG for my example application :-)

For my first implemented postprocessing algorithm I need color and depth
data. As I want the depth to be linearized between 0 and 1, I used a
shader for that and also I render it in a separate pass to the color.
This stuff is then fetched from the GPU to the CPU by directly attaching
osg::Images to the cameras. This works perfectly, but is quite a bit
slow, as you might already have suspected, because the data is also
processed in CUDA kernels later, which is quite a back and forth ;-)

In fact, my application with three filter kernels based on CUDA (one
gauss blur with radius 21, one image subtract and one image "pseudo-add"
(about as elaborate as a simple add Wink) yields about 15 fps with a
resolution of 1024 x 1024 (images for normal and absolute position
information are also rendered transferred from GPU to CPU here).

So with these 15 frames, I thought it should perform FAR better when
avoiding that GPU <-> CPU copying stuff. That's when I came across the
osgPPU-cuda example. As far as I am aware, this uses direct mapping of
PixelBuferObjects to cuda memory space. This should be fast! At least
that's what I thought, but running it at a resolution of 1024 x 1024
with a StatsHandler attached shows that it runs at just ~21 fps, not
getting too much better when the cuda kernel execution is completely
disabled.

Now my question is: Is that a general (known) problem which cannot be
avoided? Does it have anything to do with the memory mapping functions?
How can it be optimized? I know that, while osgPPU uses older CUDA
memory mapping functions, there are new ones as of CUDA 3. Is there a
difference in performance?

Any information on this is appreciated, because it will really help me
to decide wether I should integrate buffer mapping or just keep the
copying stuff going :-)

Best Regards
-Thorsten



--
This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard.
The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html.

This message has been scanned for viruses and dangerous content by MailScanner,
and is believed to be clean. MailScanner thanks Transtec Computers for their support.



------------------
Post generated by Mail2Forum
Back to top
art (Art Tevs)
Site Admin


Joined: 20 Dec 2008
Posts: 414
Location: Saarbr├╝cken, Germany

PostPosted: Fri Jan 07, 2011 8:34 pm    Post subject:
Re: osgPPU CUDA Example - slower than expected?
Reply with quote

Hi Thorsten,

the problem which you experience is because of lacking direct memory mapping between OpenGL and CUDA memory. Even if CUDA (at least it was in version 2 so) supports GPU<->GPU memory mapping, whenever you access to OpenGL textures there is a full memory copy performed.

I am not aware if this was solved in CUDA3, maybe you should check it out. CUDA2 definitively doesn't perform direct mapping between GL textures and CUDA textures/arrays.

regards,
art


Thorsten Roth wrote:
Hi,

as I explained in some other mail to this list, I am currently working
on a graph based image processing framework using CUDA. Basically, this
is independent from OSG, but I am using OSG for my example application Smile

For my first implemented postprocessing algorithm I need color and depth
data. As I want the depth to be linearized between 0 and 1, I used a
shader for that and also I render it in a separate pass to the color.
This stuff is then fetched from the GPU to the CPU by directly attaching
osg::Images to the cameras. This works perfectly, but is quite a bit
slow, as you might already have suspected, because the data is also
processed in CUDA kernels later, which is quite a back and forth Wink

In fact, my application with three filter kernels based on CUDA (one
gauss blur with radius 21, one image subtract and one image "pseudo-add"
(about as elaborate as a simple add Wink) yields about 15 fps with a
resolution of 1024 x 1024 (images for normal and absolute position
information are also rendered transferred from GPU to CPU here).

So with these 15 frames, I thought it should perform FAR better when
avoiding that GPU <-> CPU copying stuff. That's when I came across the
osgPPU-cuda example. As far as I am aware, this uses direct mapping of
PixelBuferObjects to cuda memory space. This should be fast! At least
that's what I thought, but running it at a resolution of 1024 x 1024
with a StatsHandler attached shows that it runs at just ~21 fps, not
getting too much better when the cuda kernel execution is completely
disabled.

Now my question is: Is that a general (known) problem which cannot be
avoided? Does it have anything to do with the memory mapping functions?
How can it be optimized? I know that, while osgPPU uses older CUDA
memory mapping functions, there are new ones as of CUDA 3. Is there a
difference in performance?

Any information on this is appreciated, because it will really help me
to decide wether I should integrate buffer mapping or just keep the
copying stuff going Smile

Best Regards
-Thorsten


------------------
Post generated by Mail2Forum
Back to top
View user's profile Send private message Visit poster's website
Jason Daly
Guest





PostPosted: Fri Jan 07, 2011 9:23 pm    Post subject:
osgPPU CUDA Example - slower than expected?
Reply with quote

On 01/07/2011 03:34 PM, Art Tevs wrote:
Quote:
Hi Thorsten,

the problem which you experience is because of lacking direct memory mapping between OpenGL and CUDA memory. Even if CUDA (at least it was in version 2 so) supports GPU<->GPU memory mapping, whenever you access to OpenGL textures there is a full memory copy performed.

I am not aware if this was solved in CUDA3, maybe you should check it out. CUDA2 definitively doesn't perform direct mapping between GL textures and CUDA textures/arrays.

regards,
art

I know that OpenCL 1.1 added a bunch of OpenGL interoperability features
(clCreateFromGLBuffer(), clCreateFromGLTexture2D(), etc.), and I thought
I heard that the newer versions of CUDA supported similar features.
OpenGL 4.1 added some CL interop features, too.

--"J"



------------------
Post generated by Mail2Forum
Back to top
Thorsten Roth
Guest





PostPosted: Fri Jan 07, 2011 9:29 pm    Post subject:
osgPPU CUDA Example - slower than expected?
Reply with quote

Thanks for the answers. Actually I also know that there are new
interoperability features in CUDA 3, but I didn't have the time to check
them out yet, though if I find the time for it, I will let you know
about the results :)

Regards
-Thorsten

Am 07.01.2011 22:23, schrieb Jason Daly:
Quote:
On 01/07/2011 03:34 PM, Art Tevs wrote:
Quote:
Hi Thorsten,

the problem which you experience is because of lacking direct memory
mapping between OpenGL and CUDA memory. Even if CUDA (at least it was
in version 2 so) supports GPU<->GPU memory mapping, whenever you
access to OpenGL textures there is a full memory copy performed.

I am not aware if this was solved in CUDA3, maybe you should check it
out. CUDA2 definitively doesn't perform direct mapping between GL
textures and CUDA textures/arrays.

regards,
art

I know that OpenCL 1.1 added a bunch of OpenGL interoperability features
(clCreateFromGLBuffer(), clCreateFromGLTexture2D(), etc.), and I thought
I heard that the newer versions of CUDA supported similar features.
OpenGL 4.1 added some CL interop features, too.

--"J"




------------------
Post generated by Mail2Forum
Back to top
Harash Sharma
Guest





PostPosted: Sun Jan 09, 2011 6:12 pm    Post subject:
osgPPU CUDA Example - slower than expected?
Reply with quote

Dear Mr. Art,


I too am noticing a problem similar to what Mr. Thorsten pointed out. Just curious about if the openGL and CUDA going together, I downloaded the osg2.9.10 and osgCompute nodekit. I have CUDA 3.2 installed with on my machine Core2Duo with GEForce. The osgGeometryDemo sample code for warping with cow.osg is giving a reasonably high frame rate. I thought I should share this in case it is of any help.


Regards


Harash

From: J.P. Delport <>
To: OpenSceneGraph Users <>
Sent: Mon, January 3, 2011 3:30:34 PM
Subject: Re: osgPPU CUDA Example - slower than expected?

Hi,

I don't have any other suggestions than to use a GL debugger to make
sure nothing is going to CPU or to try the new CUDA functions in osgPPU
or your own code. I remember something in the GL to CUDA stuff bugging
me, but cannot remember the details. AFAIR something was converting from
texture to PBO and then to CUDA mem.

jp

On 16/12/10 13:25, Thorsten Roth wrote:
Quote:
Hi,

as I explained in some other mail to this list, I am currently working
on a graph based image processing framework using CUDA. Basically, this
is independent from OSG, but I am using OSG for my example application Smile

For my first implemented postprocessing algorithm I need color and depth
data. As I want the depth to be linearized between 0 and 1, I used a
shader for that and also I render it in a separate pass to the color.
This stuff is then fetched from the GPU to the CPU by directly attaching
osg::Images to the cameras. This works perfectly, but is quite a bit
slow, as you might already have suspected, because the data is also
processed in CUDA kernels later, which is quite a back and forth Wink

In fact, my application with three filter kernels based on CUDA (one
gauss blur with radius 21, one image subtract and one image "pseudo-add"
(about as elaborate as a simple add Wink) yields about 15 fps with a
resolution of 1024 x 1024 (images for normal and absolute position
information are also rendered transferred from GPU to CPU here).

So with these 15 frames, I thought it should perform FAR better when
avoiding that GPU <-> CPU copying stuff. That's when I came across the
osgPPU-cuda example. As far as I am aware, this uses direct mapping of
PixelBuferObjects to cuda memory space. This should be fast! At least
that's what I thought, but running it at a resolution of 1024 x 1024
with a StatsHandler attached shows that it runs at just ~21 fps, not
getting too much better when the cuda kernel execution is completely
disabled.

Now my question is: Is that a general (known) problem which cannot be
avoided? Does it have anything to do with the memory mapping functions?
How can it be optimized? I know that, while osgPPU uses older CUDA
memory mapping functions, there are new ones as of CUDA 3. Is there a
difference in performance?

Any information on this is appreciated, because it will really help me
to decide wether I should integrate buffer mapping or just keep the
copying stuff going Smile

Best Regards
-Thorsten
_______________________________________________
osg-users mailing list
(
Only registered users can see emails on this board!
Get registred or enter the forums!
)
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


--
This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard.
The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html.

This message has been scanned for viruses and dangerous content by MailScanner,
and is believed to be clean. MailScanner thanks Transtec Computers for their support.

_______________________________________________
osg-users mailing list
(
Only registered users can see emails on this board!
Get registred or enter the forums!
)
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org

------------------
Post generated by Mail2Forum
Back to top
Display posts from previous:   
Post new topic   Reply to topic    OpenSceneGraph Forum Forum Index -> osgPPU [osgPPU] All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum

Similar Topics
Topic Author Forum Replies Posted
No new posts Pass an osg::Texture2D to CUDA driver... mille25 General 6 Wed Jun 15, 2016 12:37 pm View latest post
No new posts Contemporary multipass non-OsgPPU glo... Chris Hanson General 1 Mon Dec 14, 2015 8:47 pm View latest post
No new posts ffmpeg library version expected for O... sam General 3 Tue Sep 08, 2015 11:25 pm View latest post
No new posts Self built 3.2.1 3x slower than repo'... robertosfield General 2 Wed Nov 19, 2014 5:52 pm View latest post
No new posts Self built 3.2.1 3x slower than repo'... simon General 7 Thu Jul 24, 2014 4:33 pm View latest post


Board Security Anti Bot Question MOD - phpBB MOD against Spam Bots
Powered by phpBB © 2001, 2005 phpBB Group
Protected by Anti-Spam ACP