New version. Main additions are an color rasters, saving SRAM to VMU, in-game pause menu, better controller support (6 button and 2 player), better line scroll support, and partial shadow/highlight support.
First, some replies:
Ian Micheal wrote:On my update port of Gensplus SDL which i ported in 2003 it's using dma zero frameskip or limiter as you can see parts run well overspeed
So it does VDP rendering on the CPU, and it's too fast? I tried coming up with a fast software renderer, but drawing and combining the layers seemed too much for the SH4 to do while leaving time for CPU and sound emulation. I tried some hybrid software and hardware renderers, but they didn't turn out well. What is yours doing that makes it fast? What kinds of raster effects does it support? How consistent is its speed? Is there any point to working on Gens4All?
Ian Micheal wrote:All the benchmarks DMA was the same or faster then SQ and i have found this to be the case when things get demanding, DMA does not bounce around or stall compared to SQ the change of using DMA on dreamneo cd and not SQ's is pretty vast up to 10fps slower using SQ
Those are just raw bandwidth figures in a vacuum. What matters is overall system speed in real world conditions. Theoretically, DMA is slower than SQs. If you DMA something that the CPU generates, the process looks like this:
- CPU reads source data
- CPU writes results to main RAM
- DMA reads results from main RAM
- DMA writes to hardware
With SQs, you just have
- CPU reads source data
- CPU writes results to hardware
...which seems faster and more efficient, but that's not always the whole story. For example, writing to hardware can be slow, so using DMA can allow the DMA controller to be slowed down instead of the CPU. I was using DMA for transferring the texture before because I didn't know how to do fast texture SQs.
What is the time saved by trying to avoid one of the consoles features ?
SQs are a feature, too. It's about picking the right one.
With my version of Gens4All, rendering is done with the PVR. The Genesis's VRAM is converted into a texture for the PVR. The texture is a specially prepared VQ compressed texture to allow for faster conversion of 4-bit data from the Genesis's VRAM then is possible with the PVR's normal 4-bit texture format. Creating the texture took four steps originally:
- Generate palette codebooks
- Reorder tile data into texture
- Flush cache so the results are in main RAM
- DMA texture from main RAM to texture RAM
It's not possible to do texture DMA while sending commands to the TA. The TA gets confused and the system hangs, you have to sit and wait for DMA to complete. (Another option would be to send commands to the TA, then do DMA, and hope that the DMA completes before KOS decides to trigger rendering. It would probably work most of the time, but I didn't try it.)
I profiled the original version of the texture generation that was used. The different parts took this long:
Code: Select all
CB 0.10 ms
VRAM 0.97 ms
Flush 0.04 ms
DMA 0.28 ms
Total 1.39 ms
Getting rid of DMA allows skipping the Flush and DMA sections. With the SQ version, it now takes this long:
Code: Select all
CB 0.10 ms
VRAM 0.49 ms
Flush 0.00 ms
DMA 0.00 ms
Total 0.59 ms
The need to flush the cache and do DMA have been eliminated, and converting Genesis VRAM into a texture is faster. Why did the texture conversion get faster? It's possible to have the SQ version can write to main RAM/cache, not using the SQ at all. It runs about the same speed as the original version. It's possible to experiment with it to get an idea where the speed increase comes from.
The original version went through each tile and wrote it out to 8 lines of the texture. This doesn't work well for SQs, which write 32 byte blocks instead of 4 bytes. The new version takes a row from 8 different tiles and writes one block to the SQ.
One advantage of the SQ is that there are no cache misses on writes. When writing to a cache line not already in cache, the SH4 has to load what's already in RAM at that location before preforming the write. The MOVCA.L instruction exists to avoid this delay, but with the old version scattering its writes around, it was hard to use it correctly. Since the SQ version always writes to 32-byte, cache line sized blocks, it's easy to modify it to MOVCA.L to avoid the cache miss penalty. This change also allows the cache flush step can be eliminated with a OCBP or OCBWB instruction.
Another problem with writing to cache is cache thrashing. Reading from the Genesis's VRAM can force the cache line we are writing the texture to to be forced out, and writes to the texture can cause data we are going to read to also be unloaded, if the addresses happen to line up wrong. The SH4's cache has the "operand cache index" feature which can prevent some of this. It basically splits the 16 KB direct-mapped data cache into two software controlled 8 KB data caches that can't cause each other to thrash. I modified KOS to enable this.
With these changes on a DMA based version, the conversion time is only 0.58 ms, with a total time of about 0.93 ms after codebook and DMA overhead. It's still slower than using the SQs. I'm not sure where the 0.09 ms difference in the VRAM processing is. Maybe DRAM page misses caused by going between Genesis VRAM and the texture buffer?
I also tried pointing the SQ version to main RAM, then DMAing the results. It runs exactly as fast as the optimized DMA version.
As for using DMA for rendering… DMA slows basically slows down the a bit CPU while it's running. Whether or not it's worth it depends on the rendering code. Using DMA for sending polygons to the TA requires large main RAM buffers and increases input latency.
When working on my PVR driver, I did some (non-rigorous) testing of DMA vs SQs. On a test designed to be a worst case for SQs (very large polygons), switching to DMA made the CPU's T&L two to three times faster, since it wasn't waiting on the TA. But the gain would change depending on the size of the polygons, with the advantage getting smaller as the polygons got smaller. In another test, with my rendering code, drawing a game-like scene, I saw a 5% CPU slowdown using DMA over SQs. I didn't try a SQ best case situation. It probably wouldn't be as big of a difference as DMA best case, but still into double digit percentages.
So DMA TA submission seems to have better performance than SQs under worst case conditions, but under the right conditions (which may require more work to achieve) SQ can be faster, or at least the same speed, without the extra latency and main RAM buffers.
For Gens4All, drawing 8x8 quads, I highly doubt DMA would be faster than SQs. Each quad is so tiny (TA doesn't have to do much work writing pointers to the tile matrix) and there's a bit of delay in between each tile while it processes things like palettes and flipping and priority (TA has time to finish writing the vertex data and pointer), so I think it's unlikely to help here. The tile rendering function does some table look ups to figure out what texture and UVs to use, and DMA would probably slow this down.
MastaG wrote:TapamN, if you create a Patreon account I'd be happy to support you as well.
I don't think I could produce enough results on a constant enough interval for it to be fair for someone to give me money. I doubt there are enough people out there who would be interested anyways, so it probably wouldn't be worth the effort.
MastaG wrote:Why not spend some time forking the official kos repository to your own GitHub account and simply commit all of your changes and improvements so everyone can benefit.
The only notable change to my current KOS setup at the moment is OCINDEX support. Aside from my old modified version of KOS's PVR driver (which had problems), all the changes I made seem to be in KOS in some form now.
As for the new release...
The color rasters are enough to get water in Sonic games working correctly. The water in CV Bloodlines doesn't seem to work? There's still no scrolling rasters.
The SRAM saving is kind of in a "barely working" territory at the moment. The filename for the save is generated by hashing the product id of the ROM. At the moment, there's nothing in place to avoid hash collisions. The hash is pretty weak, too. Saves always go to and are read from the first VMU found. Saves don't have an icon.
It's possible for the VMU file to grow between saves. Gens calculates the size of the save file by looking at how much of the SRAM array is zeroes, and completely ignores the ROM header. If the game doesn't initialize the entire SRAM, Gens will think the save is smaller than it really is. For example, on Phantasy Star 4, saving only in the first slot results in a smaller file than saving in the second or third slots. So it's a good idea to have extra free space on the VMU when saving, just to be safe.
When saving in the in-game pause menu, if you don't have space, you can swap out or erase files on the VMU and try again. The save menu that comes up when exiting the game only gives you one try, so it's probably better to try to save from the menu.
Like vanilla Gens, EEPROM saves aren't supported at the moment (Wily Wars, MW4, Micro Machines).
Pressing up on the analog stick pauses the game and opens a menu. If you're using a controller without an analog stick (like an arcade stick) you can also open the menu with A+B+X+Y+Start.
You can change the controller settings here. "DC 4B to Gen 3B" is the same as previous releases. It now correctly configures the Genesis to treat it as a 3 button controller. "DC 4B to Gen 6B" is a generic 6 button controller mapping. The controls looks like this:
DC A -> Gen C
DC X -> Gen B
DC B -> Gen A
DC Y -> Gen Y
DC L -> Gen X
DC R -> Gen Z
DC Analog Down -> Gen Mode
There are two extra variants for The Lost Vikings (Maps A,B,C to X,A,B, also good for Ranger-X) and Street Fighter II (SNES style layout). If you have an arcade stick or third party 6 button controller, you can pick "DC 6B to Gen 6B" to directly map the face buttons between each system (Mode is on R button).
The same mode is used for both player 1 and player 2. There's no multitap support at the moment.
There are also options in the menu to save SRAM to VMU and reset the game. There's a second menu that allows changing rendering options. The first one controls how scrolling is simulated.
Cell Scroll does not try to simulate line scroll at all (like previous releases) and Line Tilt tries to simulate line scroll by tilting the tiles (like Smash Pack). The auto split options look through the line scroll table to try to figure out the best way to render it. If the line scroll is constant through the tile, it draws it normally. If there's one place in the tile that changes, it cuts the tile in two pieces and draws line scroll perfectly. If there's more than one change in line scroll, it uses a fallback approximation, controlled by the "div" parameter. "1 div" falls back to drawing the tile how Cell Scroll or Line Tilt would normally draw it. "2 div" divides the tile into two 8x4 tiles and does higher resolution cell scroll or tilting with them. The "4 div" options divides the tile into four 8x2 tiles, for even higher resolution line scroll. With 4 div, only one layer (A/B) can be divided, since the CPU and GPU usage is high.
The fastest options are Line Tilt and Cell Scroll without auto split. If the game has trouble maintaining 60 FPS, try switching to those. For Sonic 2's special stages, switch to "Auto Split (4 div A)", or else the pipe will be low resolution.
It's possible to disable the shadow/highlight emulation. Currently, only tilemap shadows work. Sprites cannot shade or highlight a pixel, although sprites are affected by tilemap shadows.
Two options in the pause menu enable or disable the performance graph and a raster display. The graph shows the timing of the SH4 CPU, CPU render time, GPU render time, and frame length. The raster display shows short lines on the left edge of the screen when certain types of raster effects are detected.
The option "Render at VBlank Start/End" controls when the screen is rendered relative to the Genesis's VDP timing. Some games display better/worse depending on the setting. Previous versions rendered at VBlank Start, but this version defaults to VBlank End because it has better synchronization with color rasters. If things aren't showing up, try changing this setting.
Due to how the emulator is set up, when the pause menu is displayed, the game is always rendering at VBlank Start, so toggling the setting doesn't update the screen, and the game might appear differently paused than how the game looks while playing if set to VBlank End.
There's currently no way to save controller and render settings to the VMU, but they are preserved when switching between games.
A prebuilt ELF executable is included with the source. I left out the .o files this time.
Edit: Attached file removed because of bug. Use the attachment from this post.