-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vertical bars over HDMI / HyperRAM controller lock-up #145
Comments
@amb5l FYI |
This comment was marked as duplicate.
This comment was marked as duplicate.
@AnttiLukats: Do you have advise - for example, does it make sense to tweak the IDELAY constant? You changed this from 20 to 12 in your fix. Not sure whether we should tweak it up or down though, if at all: ce927cb#diff-be2f3034c39a2edc9f50cff2e2bdbddb942305f2f8016eb3d7add013748a8ad8L58 |
Interesting idea from @amb5l via Skype:
|
Hi, the "BUFR + IODELAY12 fix" made a design that I am testing on CR00107 Spartan7 very stable. Before the fix it was failing on one single type of RAM check: increment test passed, walking one passed, walking zero passed, constant 0xAAAA5555 passed and incrementing address failed. After changing IDELAY to 12 all tests passed, and are passing for 7+ days continued testing. When the error occurred it was always one single bit flip read out once wrong, next read to the same location returned already correct value. If my fix makes R3 to show stripe pattern then this may already lead to the solution. The BUFR IDELAY12 fix I would say can only introduce "read wrong bit once" type error. The delay 12 should be pretty good, I do not think that IDLEAY tuning can fix the stripe error. The stripe error must be somewhere else, possible not at all in HyperRAM IP core. Some stupid timing issue that happens. |
@AnttiLukats Thank you - the problem is that the fix is also not stable on all R6 boards. See above. We have at least two R6 users who have the original barcode problem. We thought we had solved it because none of us, including @paich64 who did multi-day tests and including Deft could reproduce it any more. Scroll up, read "Summary of the current situation. And this "barcode" problem occurs because the ASCAL component that is in our HDMI rendering pipeline fails due to a locked up HyperRAM. So we are sure that this is a HyperRAM issue, but we are not sure why. |
you say locked up HyperRAM what you mean by this? That the RWDS is not detected and transaction on memory bus is not completed at all? This would yield very constant pattern, but probably on one transaction only, as the IP Core would recover on the next read. Stopped clock should not be a problem. Dynamic latency should also not be the problem. Not ideal IDELAY should not be the problem. To me it seems that UNRELATED change in the code makes the bug to appear more or less. This is an indicator that we have somewhere an timing issue. Can you increase the main clock timing constraint say by 10% and retry? Does it make a difference? |
BARCODE ERROR:
Thing is that HypeRAM IP core can not return barcode data as data read from HyperRAM. The fact that we see barcode, means the bug is somewhere else. |
The barcode error happens because the HyperRAM IP Core goes into a lock up. What happens is that the Avalon Memory Map interface uses burst mode, and if the IP Core does not return the correct number of words as requested, then the interface locks up (the client connecting to the IP Core is waiting indefinitely). So the question - as I see it - is what could cause the HyperRAM IP Core to not return the correct number of values? One plausible cause I can see is the sampling of the RWDS signal to determine the latency of the transaction. |
do not get this, does it lockup forever? Where does the barcode data come from then? The RWDS sampling for latency check is hard to get wrong, I do not think that it is a problem. There is some time CSN must be deactivated between requests, is that OK with your core? But this also is not likely issue. I must go, I will think about this some more. |
The barcode is generated by the HDMI rescaler, which uses the HyperRAM as a frame buffer. The rescaler continuously reads frame data from the HyperRAM into a small RAM, and generates the output display from this RAM. So if the data in the RAM does not get updated (due to the HyperRAM IP core lockup), then we get these barcodes. Yes, the Avalon Memory Map interface protocol does not specify a timeout. So if the rescaler requests 64 words in a read request, and only gets 63 words, then the rescaler will just wait forever for the last word. |
The problem is in hyperram_rx.vhd file, at the very end (last line). The code as it is causes the statemachine in the controller module to freeze on the first time when the FIFO is empty. So the code as is only works if ALL data from the FIFO comes out as one chunk without any caps. One cap in FIFO stream, and STALL. fix: -- This skips the first clock cycle of data from the FIFO. this is maybe not the best way, but the code still works and the bug should be gone. |
@AnttiLukats Thank you for continuing to support us. @MJoergen is currently indeed investigating the FiFo, but he is not 100% convinced yet, that your proposed solution will work. For your convenience and for tapping into your brain as well, I am copying a group conversation between @MJoergen and @amb5l that happened on Skype. Thoughts and feedback welcome: MJoergen: Status report:
Point 2 above means it is imperative that the controller correctly reads the RWDS. There is no possibility of work-arounds (e.g. by adding extra clock cycles) to mitigate an incorrect sampling. Here is the relevant page of the datasheet: This ILA trace shows a read transaction with long latency. It's captured using a 200 MHz clock, so that we can see the transitions on the RWDS signal. Some signals cannot be captured by the ILA, because the output register is placed in the I/O pad and cannot be connected to the ILA. In this build, I've made use of the "falling edge + rising edge" two-stage synchronizer. So you see there is a latency of a 10 ns from "rwds_in_delay_idelay" to "hb_rwds_in_i". The HyperRAM clock is generated by an ODDR fed from "hb_ck_ddr_o". There is a 10 ns latency from the "0 -> 2" transition of the "hb_ck_ddr_o" signal and the first rising edge of the HyperRAM clock. This is not seen in the ILA trace. The yellow marker in the ILA trace is the sampling time of the "hb_rwds_in_i" signal. When looking at the ILA trace it seems close / marginal. I'm comparing the sampling time at time 2048 with the falling edge at 2050, which is a difference of just 10 ns, i.e. 1 clock cycle. In other words, it would seem "more" stable to delay the hb_rwds_in_i signal by one clock cycle, but that is exactly what we're already doing in the released version. As mentioned above, this trace is with the "faster" synchronizer. I'm beginning to suspect that we (I) have not fully understood the problem (yet). amb5l OK I am thinking about this. You've seen Antti's suggested fix? He suggests the read data fifo is going empty and that stalls the Avalon burst which causes a crash. MJoergen I couldn't quite follow his reasoning. Nor do I see how the read data fifo could suddenly go empty. But as mentioned, I may be missing something. amb5l Does the Avalon side of the FIFO wait for a "not empty" flag or does it just start reading at a fixed point in time (perhaps determined by RWDS/latency) MJoergen It waits (indefinitely). In the HyperRAM example design (separate repo) I've added some error signals (FIFO overrun/underrun, etc). But that's only helpful if we can reproduce on our own board. amb5l Of course. I hope (sort of) that my MEGA65 will exhibit this issue So let me understand. Your controller causes clocks and address/command bytes to be fired into the HyperRAM for a read cycle, and - for the first 3 clocks (6 edges) the RWDS signal is driven by the memory to signal latency. You sample it - presumably as you get to the end of driving the address/command - to decide what to do next. Why, given that RWDS will have been driven for a while, do you pass it through a synchroniser? MJoergen Not any specific reason. More a case of trying to follow "best practices". But I get your point. amb5l You may need a timing exception in your constraints if you go asynchronous (I think you should try this). Otherwise the tools work to close timing on an assumed synchronous RWDS input. Perhaps FALSE PATH to the registers in the state machine that sample it. MJoergen But will it make any difference? My reasoning is as follows: In the released version there is a two-stage synchronizer. If I remove that and simultaneously sample the signal two clock cycles earlier, the net result should be the same, right ? amb5l The problem is that I am not very clear about when the controller samples RWDS, and what the latency from the controller to the HyperRAM pins (through IOB registers etc) is. If the controller is sampling RWDS "soon", then the synchroniser will introduce a risk of reading the wrong value. Could you run a quick simulation and screenshot the HyperRAM signals next to the controller state machine state and the synchroniser output? Sorry if this is a lot to ask. MJoergen that should take just a few minutes. amb5l Will really help me understand Away from my desk for 20 minutes now MJoergen This shows a read transaction with short latency. The yellow marker is the time of sampling the signal "hb_rwds_in_i". You'll notice that the sampling takes place roughly two clock cycles later than required (relative to the HyperRAM signals), in order to account for the input synchronizer. The HyperRAM signals are the five indented rows near the top. This is based on a simulation model I downloaded from the Internet. Note: This is supposedly a 16-bit read (i.e. one word), but the controller sends an additional clock cycle to the HyperRAM, and therefore receives two pulses on the "hr_rwds_io" signal. The first pulse contains the word 0x1698, which is what is returned to the client. The second word is discarded. This little trick is because the clock driving the CDC FIFO is not free-running, so in order to "push" the data out of the FIFO, I have to add an "extra" clock cycle. Here is another waveform where I've added the Receive FIFO signals. I'm actually a bit worried that the "src_clk_i" signal (which is the IDELAY'ed version of the "hr_rwds_io" signal) is red sometimes. This reflects that sometimes neither controller nor HyperRAM is driving this signal. And yet we're feeding it into the clock of a flip-flop. This would make it susceptible to noise. I do have the "src_valid_i" signal which essentially is a clock-enable of said flip-flop. I can't quite see, what effects it would have if the receive FIFO receives too many clock cycles. That could lead to overrun (rather than underrun). Not good, for sure. But can this explain the controller lock-up ? Here is a trace with a "long" latency. I'm logging out for now, but feel free to write your thoughts, and I'll look into it tomorrow. amb5l
I will keep thinking. MJoergen Ok, I have a new theory: I'm looking at the last waveform, in particular the Receive FIFO. It seems to me that there is a rather long delay from "src_gray_wr" to "dst_gray_rd". Specifically, I'm concerned about a FIFO overrun. The FIFO contains two words, and use gray codes for counters. Input and output clocks are the same frequency (nominally, more about that below). I do have a two-stage synchronizer on the FIFO data signals, and a three-stage synchronizer on the control signals. This difference is to prevent the control signals from updating before the data signals. Another thing I haven't considered before is the variability of the Clock -> RWDS delay. I've so far only considered this variability to be between devices, but it could also be manifest as jitter. I see nowhere in the datasheet stating otherwise. A large jitter, plus differences in path lengths of data and control signals in the FIFO, could lead to overflow, because the input frequency is not constant. I'm trying to demonstrate this theory in simulation. Right now I'm surprised this works at all, i.e. on any boards. It seems to me (based on the simulation waveform) that it should fail all the time, at least for burst transactions. Regarding possible solution, I'm considering two possibilities: Either reduce the synchronizer length on FIFO data and control signals (to one and two respectively), or increase the FIFO depth from two to four words. am5bl Interesting theory. I like the idea of a four word FIFO, personally. |
long story. But there really is a bug that locks up when FIFO is once EMPTY. and my fix fixes it, so why not test it out? What I reported is a REAL BUG even if it is not our current problem, so a FIX should be applied! |
@AnttiLukats Yes, your proposed fix is independent from @MJoergen 's Odyssee. Also the four word FIFO idea from @MJoergen makes sense and might improve the situation. So what we will do is: Create a test build and work with the two guys on Forum64.de who can reproduce the problem to test it. This will take a few days. Please stand by for results. Thank you again for your persistence :-) and for your support. =========== Well, there is a BUG in the existing code, that causes stall on first time FIFO goes EMPTY. My fix fixes this issue, tested code with FIX still works here. The FIFO is maximum 1 full, so there is 0 or 1 word to read out from the fifo always. So the current code only works when the delayed RWDS always writes one word before the next clk_i rising edge comes. So the FIFO would be always 1 full when the clk_i edge comes. This works well as long as RWDS falling edge is sufficiently far away from clk_i rising edge. But this is not always the case. if RWDS gets close to clk_i rising edge then the FIFO may run empty, what again causes hyperram controller to stall. This aligns with all symptoms we have so far. Changing IDELAY may have some impact on the behavior, I think this was also observed. I think we have good chances that the "EMPTY FIX" really fixes the barcode error case. But even if it does not, it still should be applied as it fixes a bug in code. |
Been thinking some more. My fix fixes the problem only when the FIFO goes empty exactly one time during a burst. If that happens more than once then the FIFO would overrun. So PROBLEM is there, my fix may help the issue, but it is not a full fix without increasing the depth of the FIFO to say 32 to be on the fully safe side. I tried to replace the hyperram fifo with xpm_fifo_async but failed, my code did not work. :( |
been thinking some more. FIFO depth 32 is for sure not required. I guess FIFO deep 2 is sufficient as well. FIFO depth 3 is 100% sufficient, FIFO depth 2? Possible also OK. |
This is the fix suggested by Antti Lukats in the comment #145 (comment)
good! I was thinking and drawing and doing excel and I think that in reality 2 deep FIFO would also work. There is theoretical possibility that FIFO occupancy would be 3 in the case that RWDS jitter during single burst is 5nS, this is not happening on real hardware so, I guess 2 level would be also ok. But if we have 4 level we are for sure on the safe side! Patch running on CR00107+Microblaze demo design (I had to apply the BUFR_IODELAY12 patch also). So the 4 level fifo works! |
@AnttiLukats As you have seen above: Michael did an Alpha release that includes your "Empty FIFO" bugfix and expands the Receive FIFO to 4 words: https://github.com/MJoergen/C64MEGA65/commits/V5.2A1/ What is next: We will test the core internally and then share it with the folks on Forum64.de who can reproduce the barcode problem on their machines. |
Testing has just started :)
Le mar. 9 juil. 2024 à 18:28, sy2002 ***@***.***> a écrit :
… @AnttiLukats <https://github.com/AnttiLukats> As you have seen above:
Michael did an Alpha release that includes your "Empty FIFO" bugfix and
expands the Receive FIFO to 4 words:
https://github.com/MJoergen/C64MEGA65/commits/V5.2A1/
What is next: We will test the core internally and then share it with the
folks on Forum64.de who can reproduce the barcode problem on their machines.
—
Reply to this email directly, view it on GitHub
<#145 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABROXODFYY2IVSPWX6EAJBTZLQFTVAVCNFSM6AAAAABKEHRTTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGE2DCMJYGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@paich64 Please share this file with TheOldMan and mgreima on Forum64 and either (a) invite/motivate them to document their results directly here in the GitHub issue (creating a GitHub account is easy) or (b) please do capture their feedback in the Forum and document their results here: C64MEGA65-V5.2A1-experimental-R6-boards-only.zip I guess you cannot upload ZIPs to Forum64 so for your convenience here is a short-link to this very file which directly starts the download: https://t.ly/ujcev |
Since the experimental core is on a branch called "V5.2A1", which might go away in case it is not successful, I decided to document the corresponding inofficial.md entry in the "develop" branch as this will be persistent information.
I did so, no feedback yet, i will keep checking and report here. I have also asked TheOldMan and mgreima on Forum64 to report here if they can. |
In the 64 forum I am TheOldMan. I have now tested all affected games for the “barcode” bug with the new core 5.2. None of the errors occurred! There were no problems even with different resolutions. There were no barcodes visible in any game! |
@Homezecke Thank you VERY MUCH for reporting this bug and for testing the workaround. Please continue to use this Alpha 1 release of the version 5.2 of the C64 core which is supposed to be released in 2025 (or late in 2024). Please do not pro-actively encourage people to use it if they are not encountering the "barcode" bug. Currently we assume it is less than 1% of the R6 population who are affected by this glitch of our HyperRAM controller. We will need quite some additional time to find/fix the root cause and/or run through our release test protocol, therefore this Alpha release is only meant as a "painkiller" for people like you and not for the broad public. I wish you a lot of fun with the C64 core :-) |
@MJoergen and @AnttiLukats Fix works now for two users who experienced issues: @Homezecke from Forum64 and "Sidde" from Discord, see screenshot: |
We just got confirmation that the issue is also fixed for @mgreima ! |
Awesome! Thank you @paich64 for supporting the Forum64. I documented next steps on Discord: see also this link: https://discord.com/channels/719326990221574164/801767398675316756/1261683009451462709
And with that I am closing this issue. @MJoergen and @AnttiLukats and @amb5l THANK YOU for your outstanding work. |
Reopened to track the progress of adding this to the 5.2 maintenance release. |
@MJoergen @paich64 @AnttiLukats Presumably our "favorite HyperRAM issue" is back:
User TheOldMan on Forum64 reported these vertical bars over HDMI and @sho3string reported vertical bars in Arcade cores - I assume similar to these, @sho3string please confirm if it looks like this and tell us: Is this on a R6 machine or on a R3/R3A?
User TheOldMan confirmed: He works on a R6 machine
@MJoergen @AnttiLukats:
The text was updated successfully, but these errors were encountered: