log in | register | forums
Show:
Go:
Forums
Username:

Password:

User accounts
Register new account
Forgot password
Forum stats
List of members
Search the forums

Advanced search
Recent discussions
- Elsear brings super-fast Networking to Risc PC/A7000/A7000+ (News:)
- Latest hardware upgrade from RISCOSbits (News:)
- Accessing old floppy disks (Gen:3)
- November developer 'fireside' chat on saturday night (News:)
- RISCOSbits releases a new laptop solution (News:4)
- Announcing the TIB 2024 Advent Calendar (News:2)
- RISC OS London Show Report 2024 (News:1)
- Code GCC produces that makes you cry #12684 (Prog:39)
- Rougol November 2024 meeting on monday (News:)
- Drag'n'Drop 14i1 edition reviewed (News:)
Latest postings RSS Feeds
RSS 2.0 | 1.0 | 0.9
Atom 0.3
Misc RDF | CDF
 
View on Mastodon
@www.iconbar.com@rss-parrot.net
Site Search
 
Article archives
The Icon Bar: The Playpen: Slightly cold news on page 10! Phlamey breaks the server.
 
  Slightly cold news on page 10! Phlamey breaks the server.
  This is a long thread. Click here to view the threaded list.
 
Adrian Lees Message #81718, posted by adrianl at 08:17, 25/10/2006, in reply to message #81677
Member
Posts: 1637
Yay!
I, of course, have done nothing :|
Oi, :flamethrower:

This and this :)


Still very early days, prototype code outside of Geminus, and the card/NVidia driver seems to have some issues with both displays used together beyond a certain resolution and/or with my funny hacked MDFs (hence the strange aspect ratio; it's 1024x1024 on each screen so that I can use a 2048x1024 for the OS.)

[Edited by adrianl at 21:07, 5/11/2006]
  ^[ Log in to reply ]
 
Jeffrey Lee Message #81719, posted by Phlamethrower at 08:21, 25/10/2006, in reply to message #81718
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Yay!
  ^[ Log in to reply ]
 
Phil Mellor Message #81720, posted by monkeyson2 at 08:25, 25/10/2006, in reply to message #81675
monkeyson2Please don't let them make me be a monkey butler

Posts: 12380
As soon as http://www.riscosopen.co.uk/ goes live, I'll get cracking ;)
how are you getting on? ;)
I drew an icon in !Paint. :|
  ^[ Log in to reply ]
 
Jeffrey Lee Message #95852, posted by Phlamethrower at 10:25, 15/12/2006, in reply to message #81720
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
I will be working on something tonight!

Note: I also reserve the right to lie through my teeth. Or in this case, my keyboard.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #95899, posted by Phlamethrower at 19:19, 16/12/2006, in reply to message #95852
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Well, that's a start. Although all the content for that page was written last week :P
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97052, posted by Phlamethrower at 22:35, 12/1/2007, in reply to message #95899
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Woo! Free textures.

Although they limit each IP to 20 downloads per day, that doesn't really matter since their thumbnails are all larger than the 64x64 tiles DeathDawn uses. So I can just do a full save of the index pages in netsurf and use the thumbnails as my source :)
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97108, posted by Phlamethrower at 20:21, 13/1/2007, in reply to message #97052
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
I've been doing some profiling!

There are three main functions eating up most of the CPU time:
* The horizontal tile plotter
* The vertical tile plotter
* The R/B colour swap function

The colour swapping function only affects the users 'lucky' enough to have the new GeForce cards. Currently it eats 40% of the CPU time. However a quick test suggests that this can be sped up by a factor of 4, by writing the colourswapped image over the original and then using DMA to copy the data across. Assuming the DMA doesn't cause the CPU to stall too much on other memory accesses, of course.

The other two - the tile plotters - take roughly the same amount of time if colour swapping is disabled (around 33% each). But the horizontal plotter is called on average 5 times more often than the vertical one - meaning that the vertical plotter is 5 times slower! The only real difference between the two is how they access the screen memory. The horizontal one draws horizontal lines, while the vertical one draws vertical lines. The vertical one obviously makes something very upset :(

If the vertical one suddenly became 5 times faster (thus matching the speed of the horizontal one), this would make the game run around 33% faster :)

The only question is *how* to make it 5 times faster, without breaking too much other code.

Answers on a postcard!

Also, it may be possible to use DMA to speed up rendering when *not* using colourswapping. The horizontal plotter seems to be about twice as fast when writing to main RAM instead of PCI RAM.

Of course, using DMA introduces other problems. In particular the IntelDMA module doesn't seem to provide any feedback as to when a transfer completes. *pokes Adrian to add said feature to his magical DMA module*

(gah, stupid NetSurf eating %'s)

[Edited by Phlamethrower at 20:26, 13/1/2007]
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97119, posted by Phlamethrower at 21:48, 13/1/2007, in reply to message #97108
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
DMA support is in, and was remarkably easy to add :) (via IntelDMA)

The colour swapping code does run 4 times faster, but this is offset by the DMA code taking about 60% of the time the colour swapping code does - so in effect it's only been sped up by 2.4 times. This results in a net 33% speed gain for the whole game.

As suspected, DMA has also had also had a positive effect when not using colour swapping - resulting in around a 13% speed improvement.

Note that none of these measurements are guaranteed, since I wasn't doing exactly the same thing for each test ;)

[edit]

Done a few more tests, this time from the same point in the map (and with the profiling code disabled, so it doesn't distort the results). This was at 480x352, with VSync disabled:

R/B swap  Use main RAM  Use DMA  FPS
Yes Yes Yes 33
Yes Yes No 25
Yes No - 9
No Yes Yes 41
No Yes No 25
No No - 30


From roughly the same spot, my RiscPC gets only 17fps (Without using main RAM or colour swapping)

[edit #2]

It looks like DMA isn't always a good thing on the Iyonix. At 640x480, it runs slower with DMA enabled than with plotting straight to PCI RAM. Tricky!

[Edited by Phlamethrower at 22:30, 13/1/2007]
  ^[ Log in to reply ]
 
Michael Drake Message #97126, posted by tlsa at 22:28, 13/1/2007, in reply to message #97108

Posts: 1097
Have you looked at the red/blue colour swapping in PicoDrive 0.11? It might be faster.

[Edited by tlsa at 22:30, 13/1/2007]
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97127, posted by Phlamethrower at 22:42, 13/1/2007, in reply to message #97126
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Have you looked at the red/blue colour swapping in PicoDrive 0.11? It might be faster.
In PicoDrive enabling colour swapping will take no extra time at all - because it's done in the pre-existing function that converts from 12bpp to 16bpp/32bpp.

If I wanted colour swapping to take no time at all in DeathDawn, I could just colourswap all the images as they are loaded. But then I'd have to go through the code and find the instances where sprites are recoloured (e.g. custom car colours) and make sure those adhere to the same rules, etc. So for now I'm just using the quick fix of colourswapping each frame after it's finished rendering ;)
  ^[ Log in to reply ]
 
Michael Drake Message #97129, posted by tlsa at 22:45, 13/1/2007, in reply to message #97127

Posts: 1097
Have you looked at the red/blue colour swapping in PicoDrive 0.11? It might be faster.
In PicoDrive enabling colour swapping will take no extra time at all - because it's done in the pre-existing function that converts from 12bpp to 16bpp/32bpp.
Ah, OK.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97148, posted by Phlamethrower at 13:52, 14/1/2007, in reply to message #97129
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
I'm playing with the textures from Mayang. They're great! I've got no idea what I'll use some of them for (e.g. the bark textures), but almost every image seems to have at least one bit which tiles nicely. And these are the 200x150 thumbnails!

FACT: Closeups of bark makes for great organic-style alien scenery :)

[Edited by Phlamethrower at 14:15, 14/1/2007]
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97467, posted by Phlamethrower at 22:24, 19/1/2007, in reply to message #78256
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
One word now you've got an Iyonix, :flamethrower:...

PLD
I tried using that. It made one function 10% slower :P

But, when placed in another function, it made it around 40-50% faster. Hurrah!

The functions were the horizontal and vertical tile row plotters, respectively. Which is good, because it was the vertical plotter which was taking loads of time. Unfortunately that 'optimisation' won't help the RiscPC users, or those writing straight to PCI/VRAM - so I'll have to do some real work instead ;)

*goes off to do benchmark his RiscPC's VRAM*
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97472, posted by Phlamethrower at 23:50, 19/1/2007, in reply to message #97467
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Hmm. If my simple benchmark thing is right, then RiscPC's don't care at all how you write to VRAM. Writing with bytes is 4 times slower than with words, using STM doesn't give any speed boost over words, and the order you write doesn't make any difference either. Which is as I'd expect, since it isn't cached.

The Iyonix, on the other hand, is completely different. Bytes, words, and STMs all give the same throughput (Just under 50MB/s for my benchmark). But if you only store every 16th byte, throughput drops to just 3.2MB/s! *snip nonsense* This is because of the time spent setting up each PCI transfer. Once a transfer is set up, it can easily transport at least 32 bytes of data, as Adrian explains below.

This means that the performance of the vertical plotter is slow because:
* On the RiscPC, it is writing using bytes, instead of words (as used by the horizontal plotter)
* On the Iyonix, it is writing in a PCI-unfriendly manner. Every 2nd write will be to a non-sequential location, which will require another PCI transfer to be setup to accomodate it.

One way of improving the situation could be to mess around with the memory map and cache controls to enable caching, but that isn't a very future proof way of doing things, and can leave things a bit messy if the game crashes. For example I have no idea how the A9 or ViewFinder's VRAM would respond to such a technique.

The solution to these problems is to try and make sure screen memory is only used sequentially, and where possible write in words or larger. There are several ways I could alter the algorithm to do this, so I think some more experimentation is in order.

[edit - moose!]

[Edited by Phlamethrower at 00:18, 20/1/2007]
  ^[ Log in to reply ]
 
Adrian Lees Message #97473, posted by adrianl at 23:51, 19/1/2007, in reply to message #97467
Member
Posts: 1637
I tried using that. It made one function 10% slower :P
Well, it's not a magic instruction that will make any code faster. You should aim to prefetch the next chunk of data that you'll be processing in about 120 cycles, sometimes that's just not possible and sometimes you have less than 120 cycles' worth of work to do on the 32 bytes that it fetches and that's that.

Also note that it PLD on memory that can't be cached does nothing, which on RO really means video RAM and any device memory mapped by the PCI module in practice.
  ^[ Log in to reply ]
 
Adrian Lees Message #97474, posted by adrianl at 23:59, 19/1/2007, in reply to message #97472
Member
Posts: 1637
Screen memory is not cached (synchronising with hardware operations on the frame buffer would be a mare if it was). Throughput is 1/16th when writing only every 16th byte because most of the write time is set up time and it is then able to burst, say, 1 byte or 16 bytes in essentially the same time; it's the difference between 1 cycle and 4 cycles after a delay of circa 20-25 cycles at 133MHz (if my quick maths is correct).

You're better off reading with an unfriendly stride from SDRAM and writing sequentially to video RAM than the other way around because SDRAM is cacheable.

Lastly, you have a graphics chipset on the other side of the pipe which is much more capable of drawing things vertically ;)
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97475, posted by Phlamethrower at 23:59, 19/1/2007, in reply to message #97473
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Also note that it PLD on memory that can't be cached does nothing, which on RO really means video RAM and any device memory mapped by the PCI module in practice.
*pciinfo on my nVidia reveals that the memory is prefetchable, and it seems to perform like it's cacheable. Is this some new feature for OS 5.12/new cards, or has it always been like it?
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97476, posted by Phlamethrower at 00:06, 20/1/2007, in reply to message #97474
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Screen memory is not cached
Hmm, you seem to be right. I must have been doing different things when getting the PLD and non-PLD times - a second test shows that it does have no effect on the video RAM.

Lastly, you have a graphics chipset on the other side of the pipe which is much more capable of drawing things vertically ;)
Well, I don't know how to use it, do I? :P
  ^[ Log in to reply ]
 
Adrian Lees Message #97477, posted by adrianl at 00:07, 20/1/2007, in reply to message #97475
Member
Posts: 1637
Also note that it PLD on memory that can't be cached does nothing, which on RO really means video RAM and any device memory mapped by the PCI module in practice.
*pciinfo on my nVidia reveals that the memory is prefetchable, and it seems to perform like it's cacheable. Is this some new feature for OS 5.12/new cards, or has it always been like it?
That's just the info reported in the PCI config registers, and without checking I'm not sure whether you're looking at the BAR for the video memory or the registers/ROM. In any case I think prefetchable may cause the PCI module to map it has Bufferable but not Cacheable. The cost of cache clean operations would almost certainly offset any advantage gained from cacheing unless you have brain-damaged screen-reading code (primary offender in a Geminus-accelerated box certainly, is the FontManager. Without Geminus there may be others too). Reading from screen really is deprecated nowadays.

Why do you think that prefetching and cacheing screen contents is beneficial in plotting code at >=8bpp anyway? Are you blending into the existing image rather than overwriting? :-s
  ^[ Log in to reply ]
 
Adrian Lees Message #97478, posted by adrianl at 00:10, 20/1/2007, in reply to message #97476
Member
Posts: 1637
Well, I don't know how to use it, do I? :P
Strangely enough, the best way to use it is to invoke the OS plotting routine since that provides a mechanism for device-specific acceleration code to step in and perform the plotting faster ;)
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97479, posted by Phlamethrower at 00:26, 20/1/2007, in reply to message #97477
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
That's just the info reported in the PCI config registers, and without checking I'm not sure whether you're looking at the BAR for the video memory or the registers/ROM.
I was looking at the second 'memory' one. But I'm not even sure if that's the right one - OS_ReadVduVariables suggests the screen memory is at &DC900000, and none of the ranges displayed by *pciinfo cover that.

Why do you think that prefetching and cacheing screen contents is beneficial in plotting code at >=8bpp anyway? Are you blending into the existing image rather than overwriting? :-s
At the moment, it only gets as complex as rending images with a 1bpp mask. But I may have a go at blending/lighting if I can get this code running fast enough :o
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97480, posted by Phlamethrower at 00:33, 20/1/2007, in reply to message #97478
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Well, I don't know how to use it, do I? :P
Strangely enough, the best way to use it is to invoke the OS plotting routine since that provides a mechanism for device-specific acceleration code to step in and perform the plotting faster ;)
Ah, but how much of that comes with RISC OS 5, and how much is added by Geminus? ;)

And does any of it support rendering an arbitrary textured quad/triangle?
  ^[ Log in to reply ]
 
Adrian Lees Message #97481, posted by adrianl at 00:40, 20/1/2007, in reply to message #97479
Member
Posts: 1637
I was looking at the second 'memory' one. But I'm not even sure if that's the right one - OS_ReadVduVariables suggests the screen memory is at &DC900000, and none of the ranges displayed by *pciinfo cover that.
That's coz they're the physical addresses of the mapped memory ranges, whereas &DC900000 is the logical address of the screen memory. The easy way is to look at the range will be &8000000 IIRC.
Base address is typically &78000000 or &70000000 in practice (PCI address space on the IOP321 end at &7FFFFFFF inclusive if memory serves).

At the moment, it only gets as complex as rending images with a 1bpp mask. But I may have a go at blending/lighting if I can get this code running fast enough :o
Then you're honestly better off testing the mask and writing out the appropriate pixels than reading, merging and writing back. A handy trick is to MOVS Rt,mask,LSL #n: STRCSB ,[Rd,#]:STRMIB ,[Rd,#] ie. test two bits and conditionally write the corresponding pixels.

A slightly more twisted approach (though often not faster than above) is to use MSR CPSR_f,Rn to set all 4 condition flags NZCV according to the values of 4 bits in Rn and then use conditional stores. You then incur the cost of a separate shift instruction for every 4 bits tested, though. If you don't have a Rt spare, you can use ROR in place of LSL.
  ^[ Log in to reply ]
 
Adrian Lees Message #97482, posted by adrianl at 00:50, 20/1/2007, in reply to message #97480
Member
Posts: 1637
Ah, but how much of that comes with RISC OS 5, and how much is added by Geminus? ;)
Line drawing isn't accelerated by RISC OS 5, only Geminus, but my point is simply that if you're not about to write your own hw acceleration code then that's your best hope of getting good performance. If you're plotting large sprites then that can be done faster than the current OS code simply because SDRAM access latency is so high and nobody bothered to sprinkle PLD instructions throughout the plotting routines. If, however, you are generating the data from within the CPU rather than copying it then you're unlikely to outrun the OS appreciably in software even given the age of the OS code. It simply takes much longer to stuff the data through the PCI bus than it does to generate it.

Ideally we'd have a more modern graphics rendering API and an appropriate acceleration driver for each machine. In this day and age writing your own software-only plotting routines is an anachronistic approach that gives worse performance.
  ^[ Log in to reply ]
 
Dave Brown Message #97483, posted by daveb at 00:52, 20/1/2007, in reply to message #97481
Member
Posts: 41
Have you considered using the approach taken by doom instead? Rather than using a 1bpp mask you could try runlength encoding the images. ie you store runs of solid pixels and runs of transparent ones so you can skip over the transparent ones quickly. This does, of course, not work so well for images using dithered solid and transparent pixels but I suspect you're not using them as they don't deal with scaling very well to say the least.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97484, posted by Phlamethrower at 00:58, 20/1/2007, in reply to message #97481
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Then you're honestly better off testing the mask and writing out the appropriate pixels than reading, merging and writing back.
I seem to be using a mix of both techniques at the moment (Including one where it only loads if it decides it needs to merge). Probably because most of the code was optimised for the RiscPC's VRAM. The code does need revising for the Iyonix - which shouldn't be too hard, since the main offender is my runtime assembled sprite plotters. The only custom plotters DeathDawn is using are the two tile plotters.

Also, I don't think your examples will help much, since I'm interleaving the mask data with the colour data ;) 1 bit mask, 15 bit colour.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97485, posted by Phlamethrower at 01:18, 20/1/2007, in reply to message #97483
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
I think it's worth reminding people that I'm trying to make an engine to draw stuff like this:



but my point is simply that if you're not about to write your own hw acceleration code then that's your best hope of getting good performance.
Assuming that what I'm trying to draw can be done using OS_SpriteOp, that is. And I doubt OS_SpriteOp is fast enough for the RiscPC version :(

If you're plotting large sprites
Map tiles are 64x64, 16bpp. So take 8k of RAM each.

Have you considered using the approach taken by doom instead? Rather than using a 1bpp mask you could try runlength encoding the images. ie you store runs of solid pixels and runs of transparent ones so you can skip over the transparent ones quickly. This does, of course, not work so well for images using dithered solid and transparent pixels but I suspect you're not using them as they don't deal with scaling very well to say the least.
Some sprites would benefit from runlength encoding, yes. It's just a case of writing the reams of code needed to process them ;)
  ^[ Log in to reply ]
 
Adrian Lees Message #97486, posted by adrianl at 02:42, 20/1/2007, in reply to message #97485
Member
Posts: 1637
but my point is simply that if you're not about to write your own hw acceleration code then that's your best hope of getting good performance.
Assuming that what I'm trying to draw can be done using OS_SpriteOp, that is. And I doubt OS_SpriteOp is fast enough for the RiscPC version :(
Well now, that probably depends upon whether you're using a VIDC20 with DRAM, VIDC20 with VRAM or a ViewFinder, OS_SpriteOp being much faster than direct rendering with the latter as I understand it.

If you're plotting large sprites
Map tiles are 64x64, 16bpp.
Then you should probably be prefetching a scanline ahead of what you're reading now.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97493, posted by Phlamethrower at 13:00, 20/1/2007, in reply to message #97486
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
After a bit of poking, I got OS_SpriteOp working for the flat map tiles. On my Iyonix, it runs about twice as slow as my own plotter for sprites with approx 1:1 scaling. If the sprites were scaled up then it may start to run faster, but most of the time the sprites will be scaled down because of the way the camera zooms out.

I may leave the code in for viewfinder users (Or Geminus, if it's able to automatically cache the sprites?), but I don't think I'm going to spend much time developing it further.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #97511, posted by Phlamethrower at 19:45, 21/1/2007, in reply to message #97493
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
I've written (half of) a new vertical tile plotter (i.e. the bit for the F and C tiles on the image above). It looks like it runs about 4 times faster than the old one when writing to PCI memory on the Iyonix, making it only 50 slower than the horizontal plotter. There's still some scope for making it faster (especially for the RiscPC, which may actually run a bit slower now), but before I do that I need to fix some texture swimming and write the other half of the plotter (Which handles tiles on the west side of blocks).

[edit]

Woo! Worked out how to fix the texture swimming.

[edit #2]

Well, kind of.

[Edited by Phlamethrower at 21:13, 21/1/2007]

[Edited by Phlamethrower at 21:24, 21/1/2007]
  ^[ Log in to reply ]
 
Pages (22): |< < 14 > >|

The Icon Bar: The Playpen: Slightly cold news on page 10! Phlamey breaks the server.