Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.4% of Acquistions Contain Inverted Points #280

Open
scott-guthridge opened this issue Dec 20, 2024 · 10 comments
Open

0.4% of Acquistions Contain Inverted Points #280

scott-guthridge opened this issue Dec 20, 2024 · 10 comments

Comments

@scott-guthridge
Copy link
Contributor

LibreVNA Version

LibreVNA Version (64 bit): 1.6.1-472594272
OS: Fedora Linux 38 (Workstation Edition)
CPU Arch: x86_64

Steps to reproduce

Example 1

DUT is an SMA tee with 25 ohm shunt resistor on the tap connected to both VNA ports through 12 inch lengths of RG316. Ignoring delays and imperfections within the tee, the S-parameters of the DUT are:

-0.5 0.5
0.5 -0.5

The following steps were run on Linux. These should be portable to any POSIX environment -- signal catching part, redirect and sort would likely need changes if run on native Windows. Note that the measure-time-series.py script creates 501 output files: 000.out, 001.out, 002.out, ... 500.out. Each file corresponds to a frequency point. Frequency range is 1 MHz to 6 GHz with 501 points. No calibration is loaded.

  • nohup python3 measure-time-series.py > errs 2>&1
  • wait at least 2 hours
  • interrupt (^C) the script (or kill -2 pid)
  • pip install libvna
  • python3 find-inverted-results.py > bad-results
  • sort -k1n -k3n -o bad-results bad-results

The find-inverted-results.py script works by averaging each frequency point across all acquisitions (saved as mean.s2p and mean.npd). Then it makes a second pass over all the files and flags outliers.

Output columns in bad-results are:

  1. acquisition number (starting with 0)
  2. time in seconds since beginning of script
  3. frequency index (starting with 0)
  4. frequency in Hz
  5. name of bad parameter: one of "S11:", "S12:", "S21:", "S22:"
  6. real part of measured parameter
  7. imaginary part of measured parameter
  8. real part of expected (average) parameter
  9. imaginary part of expected (average) parameter
  10. magnitude of measured / expected
  11. angle in degrees of measured / expected

Scripts are here:
scripts.zip

Some Example Bad Entries

Acquisition #5921   Frequency 4.740 GHz
Parameter Actual Expected Quotient Check
S11 -0.118082+0.051205j -0.118408+0.051172j +0.998 ∠ -0.1°
S12 +0.265770+0.097342j +0.265888+0.097771j +0.999 ∠ -0.1°
S21 -0.147953-0.259713j +0.145529+0.260590j +1.001 ∠ 179.5°
S22 +0.227269+0.050660j -0.226978-0.052566j +0.999 ∠ 179.5°

Acquisition #6133   Frequency 3.588 GHz
Parameter Actual Expected Quotient Check
S11 +0.265695-0.006666j -0.266811+0.004451j +0.996 ∠ 179.5°
S12 -0.273429+0.048305j +0.274837-0.045948j +0.996 ∠ 179.5°
S21 +0.120791-0.298966j +0.115721-0.301938j +0.997 ∠ 1.0°
S22 -0.208098+0.273133j -0.204339+0.277424j +0.997 ∠ 0.9°

Acquisition #6715   Frequency 5.928 GHz
Parameter Actual Expected Quotient Check
S11 +0.164147+0.064187j -0.163781-0.066038j +0.998 ∠ 179.4°
S12 -0.184360+0.084862j +0.185596-0.083079j +0.998 ∠ 179.4°
S21 -0.163825-0.107174j +0.165197+0.105820j +0.998 ∠ -179.4°
S22 +0.115633+0.226445j -0.118344-0.225502j +0.998 ∠ -179.4°

Example 2

This example is the same as above except that the DUT is now a short standard on VNA port 1 and an open standard on VNA port 2, both attached via 12 inches of RG316. No calibration is loaded.

Here, S12 and S21 are theoretically zero, but yet we can still see the same pattern of errors. I used the ?-mark on entries close to zero that nevertheless appear to be inverted.

Acquisition #128331   Frequency 4.740 GHz
Parameter Actual Expected Quotient Check
S11 +0.322637+0.174074j -0.322078-0.175699j +0.999 ∠ 179.7°
S12 +0.000040+0.000163j -0.000051-0.000181j +0.896 ∠ -178.0°
S21 -0.000039-0.000094j +0.000024+0.000090j +1.093 ∠ 172.1°
S22 -0.052470-0.297759j +0.060354+0.296529j +0.999 ∠ -178.5°

Acquisition #128724   Frequency 3.588 GHz
Parameter Actual Expected Quotient Check
S11 -0.559049-0.026885j -0.558348-0.025120j +1.001 ∠ 0.2°
S12 -0.000018-0.000069j +0.000001-0.000035j +2.036 ∠ -15.5°
S21 -0.000074+0.000105j +0.000085-0.000082j +1.089 ∠ 168.9°
S22 -0.132437+0.312732j +0.132146-0.312567j +1.001 ∠ -180.0°

Acquisition #128734   Frequency 4.752 GHz
Parameter Actual Expected Quotient Check
S11 +0.357803+0.035947j -0.357439-0.038066j +1.000 ∠ 179.7°
S12 +0.000060+0.000185j -0.000063-0.000165j +1.100 ∠ -177.0°
S21 +0.000037+0.000113j +0.000034+0.000088j +1.263 ∠ 2.9°
S22 +0.122057+0.262602j +0.128775+0.260084j +0.998 ∠ 1.4°

In both examples, I chose three adjacent bad entries that had all three examples: S11 bad, S22 bad, both bad. Interestingly, the three frequency points also happen to be the same between the two examples.

The inverted frequency points within an acquisition sometimes appear alone, but more often appear in runs.
The most popular run length was 1 followed in order by 98, 95, 6, and 96. Only these five different run lengths appear in this 170,000-acquisition long time series. Here are some example runs:

Acq # Count fMin fMax Parameters
3765 1 3.588 3.588 S22
3858 95 3.600 4.728 S11
4199 95 3.600 4.728 S11,S22
4308 1 3.588 3.588 S11
4337 1 4.740 4.740 S11
4411 1 4.740 4.740 S11
4437 1 3.588 3.588 S11
4530 6 5.940 6.000 S22
4838 1 3.588 3.588 S11
4898 95 3.600 4.728 S11
5494 95 3.600 4.728 S22
5783 95 3.600 4.728 S11
5863 1 4.740 4.740 S11
5984 1 5.928 5.928 S11,S22
6023 95 3.600 4.728 S11,S22
6167 1 5.928 5.928 S11,S22
6236 6 5.940 6.000 S11
6357 1 4.740 4.740 S11
6555 6 5.940 6.000 S11
6621 1 3.588 3.588 S22
6885 1 4.740 4.740 S11
6941 98 4.752 5.916 S22
6969 1 5.928 5.928 S11,S22
7130 1 3.588 3.588 S22
7452 1 3.588 3.588 S11,S22
7963 1 5.928 5.928 S11,S22
8058 98 4.752 5.916 S11
8264 98 4.752 5.916 S22
8734 1 5.928 5.928 S22
8742 1 3.588 3.588 S11,S22

Note that frequency span was 1 MHz to 6 GHz, but ALL of the bad entries appear above 3.5 GHz.

Spacing in time of bad acquisitions appear random:

image

But are quite uniform when viewed cumulatively:

image

A KS test done on the inter-arrival times could not reject the null hypothesis that the intervals are distributed according to an exponential distribution (e.g. from a Poisson process).

Expected behavior

Expected result is that repeated measurements of the same DUT should yield similar amplitudes and phases.

Extra information & Setup and Calibration files

No calibration is loaded.

@jankae
Copy link
Owner

jankae commented Dec 20, 2024

Thank you very much for this excellent and thorough report! I have to admit that this is not something I have noticed before. But I mostly do short tests of new features/bugfixes with my LibreVNA and do not actually use it for longer measurements most of the time. 0.4% might just be infrequent enough to slip through my tests while obviously still much too frequent to be acceptable.

At first I was very confused by this, but looking at your results a bit more, I have a theory now: The 2.LO outputs may "sometimes" have a phase difference of 180° between them. If that is indeed the case, it puts the following constraints on the inverted points:

  • S11/S12 and S22/S21 form mutually inverted pairs for each frequency point: For example, if S11 is inverted, S12 must be inverted as well
  • Inverted points at specific frequencies always show up with the same run length. For example, if the point at 3.6 GHz is inverted, this patch of inverted points will always have a run length of 95 (at least for a sweep with your frequency span and number of points)

Based on the example data you provided, this seems to be true.

I'll try to come up with a hardware experiment to verify this theory. I very much hope that it is solvable in software and not just a hardware limitation of the Si5351C.

With Christmas and New Year's Eve coming up, it might take me a bit longer than usual, but I'll let you know as soon as I have more info on this.

@scott-guthridge
Copy link
Contributor Author

A few questions:

(1) Does the reference channel share an LO output with one of the main channels? And is the reference already divided out of these measurements? If so, seems like it would hide the problem on one channel.

(2) What is the exact measuring sequence? For example, another VNA I have does a sweep driving stimulus on port 1 to measure S11 and S21 for all points, then it does a second sweep driving stimulus on port 2 to pick up S12 and S22. Within each frequency, it leaves RF at a constant phase while varying the phase of LO to each of 0, 90, 180 and 270 degrees. It combines these four sub-measurements to subtract out any DC offsets. I connected this VNA to the scope and noticed that it switches the stimulus back and forth between the ports as it sweeps. Does it also measure at four phase offsets?

(3) I noticed that *RST sets the start frequency to 0. (3a) What does this VNA do with a frequency of zero -- does that actually work? The version of EagleCAD I have is a pay version, but too old to be able to open the schematics, so I can't check for blocking capacitors. (3b) Could you make pdf versions of the schematics available? (3c) Is the default start frequency 0 or 1MHz? Is *RST consistent with power-on state?

Another observation: if I start the sweep at 0 instead of 1 MHz (which also shifts all but the last frequency point down a bit), the inversion problem seems to happen much less frequently, more like 0.005%, but it also affects lower frequencies when it does happen. This suggests something specific to certain frequency values.

@jankae
Copy link
Owner

jankae commented Dec 20, 2024

I'll change the order of your questions because I think this will help me with some later answers:

(3b) Could you make pdf versions of the schematics available?

They are already available here: https://github.com/jankae/LibreVNA/blob/master/Hardware/Schematic.pdf

(3c) Is the default start frequency 0 or 1MHz?

The "default" default start frequency is 1 MHz. Take this with a grain of salt, because there is not one default power-on state. Depending on the preferences, several options are available:

  • recall the last used settings
  • set configurable default settings
  • load a setup file (which could configure whatever it wants)

I say the default is 1 MHz because a LibreVNA-GUI that is running for the first time on a computer, will have its preferences set to load some default settings and these default settings have the start frequency set to 1 MHz.

Is *RST consistent with power-on state?

Since there is no fixed power-on state, it can be consistent with that but does not have to be. *RST always sets the maximum span as advertised by the VNA. The LibreVNA-GUI is intended to be used with different devices (although there really is only the one LibreVNA at the moment) and does not impose any limits on the frequency range itself. Instead, the USB protocol includes a "DeviceInfo" packet, which tells the GUI what the connected device is capable of. The LibreVNA has the start frequency in that packet set to 0, so that will be set after sending the *RST command.

(3a) What does this VNA do with a frequency of zero -- does that actually work?

No, it does not. You will actually see an error message in the device log if you start all the way from 0 Hz:
09827 [SI5351,ERR]: Unable to reach requested frequency
But I am not a big fan of restricting hardware more than absolutely necessary. The official lower limit for the LibreVNA is 100 kHz. At that frequency, it still performs reasonable well.

But what if someone wants to measure something just below that? The performance will be worse but maybe still useful and I do not want to prevent anyone from going a bit lower if that helps. So I did not set a hard limit in the software. You can go as low as you want and the VNA will do its best to still measure something. Of course this will work less and less the lower you go. The DC coupling caps reduce the output signal level, the mixers do not work that well anymore,... At a certain point (way above 0 Hz) you will not get anything useful at all.

(2) What is the exact measuring sequence? For example, another VNA I have does a sweep driving stimulus on port 1 to measure S11 and S21 for all points, then it does a second sweep driving stimulus on port 2 to pick up S12 and S22.

This is also what I have observed on a different VNA. But I never really understood the reasoning behind it. At faster sweep speeds, the settling time of the PLLs actually take up a significant amount of the overall sweep time. And if you drive the stimulus at one port first for the whole sweep and then at the other port again, you are sweeping the whole frequency range twice. If you switch between the ports for each frequency point, the stimulus PLL only has to settle to each point once. I am sure there is a reason why a lot (all?) VNAs do it differently than the LibreVNA, but I am not aware of any disadvantages with my approach so far.

Within each frequency, it leaves RF at a constant phase while varying the phase of LO to each of 0, 90, 180 and 270 degrees. It combines these four sub-measurements to subtract out any DC offsets.

I am not sure what the hardware architecture of that VNA looks like but this sounds unlike anything I am familiar with. Is it doing a direct conversion all the way to DC? If so, how are different IF bandwidths implemented?

I'll combine my answer to this with the next question:

(1) Does the reference channel share an LO output with one of the main channels? And is the reference already divided out of these measurements? If so, seems like it would hide the problem on one channel.

This will be a long one. The LibreVNA uses 3 down-conversions per channel, although only two are obvious on the schematic.

Down-conversion 1: The incoming signal (or internal signal for the reference receiver) is mixed with the 1st LO (1.LO). This LO moves with the stimulus frequency at always sits 62 MHz above that. The 1.LO is generated by another MAX2871 PLL which has two differential outputs. These differential outputs are used as single-ended outputs instead. RFOUTA_P goes to port 1, RFOUTA_N to the reference, RFOUTB_P to port 2. RFOUTB_N is not used and terminated directly at the PLL.

Down-conversion 2: The incoming signal is sitting at a fixed frequency (62 MHz), which means that the 2nd LO (2.LO) can also stay at a constant frequency. Three outputs of the Si5351C IC are used for that and set to generate 61.75 MHz. I need this second down-conversion in this hardware architecture because 62 MHz is still too high to sample it directly with the ADCs and the 1.LO PLL can not go below 23.5 MHz (meaning it can not generate an LO low enough to directly convert the input signal to something the ADC can use, at least not for stimulus frequencies <23 MHz).

After the 2nd down-conversion the IF signal sits at a constant 250 kHz and this is low enough to be sampled by the ADC. The ADC is sampled by the FPGA with 800 kHz. But we do not need some 250 kHz signal, we need the phase and amplitude of that signal. Everything from this point on happens in the digital domain.
The ADC samples are multiplied with a sine and cosine signal whose frequencies are also 250 kHz. This is a digital down-conversion to DC and gives us the real and imaginary signal parts we are interested. And yes, you can think of these sine/cosine signals as a 3rd LO with 0° and 90° outputs. It is just a lot easier to do all that in the digital domain because you can easily remove DC offsets and add digital filters to change the IF bandwidth.

The LibreVNA returns raw receiver values to the GUI (real and imaginary parts for each receiver). The GUI assembles these to complex receiver values and then of course divides the receiver values from port 1 and port 2 by the value from the reference receiver. This is how you end up with S-parameters.

At least that is the simplified view of things. In particular, the PLLs for the stimulus signal and 1.LO (both MAX2871s) are of interest. Every PLL has a limited frequency resolution determined by the bits in the fractional divider. The MAX2871 only has a 12 bit fractional divider. This means it simply can not generate certain frequencies. It can get close but not necessarily close enough. This results in a shift of the IF frequency at certain frequencies (when either the stimulus or 1.LO PLL deviates from the desired frequency). If this shift in frequency is large compared to the IF bandwidth, the VNA actually filters it out - it does not see anything anymore. This is obviously bad.

There is a solution to mitigate this though: the firmware can calculate the actual output frequency of the PLLs and check if the deviation is too much. It can then shift the 2.LO (which has much better frequency resolution) to compensate accordingly and bring the sampled IF back to 250 kHz. This feature is enabled by the "suppress invalid peaks" checkbox in the preferences.

And here is what I suspect causes this issue: when the 2.LO is switched to a different frequency, it is very important that all 3 outputs are still in phase. Any phase difference here will show up at the output as well. To fully understand it, you would need to read the Si5351C datasheet, but I am basically setting the internal output dividers for the new frequency and lose phase synchronization during that process because I can only configure them one after the other. To regain phase synchronization, I perform a PLL reset (of the internal Si5351C PLL) afterwards. I figured out experimentally that this works to align the phases again, I do not think the datasheet is very clear what actually happens during the reset.

Another observation: if I start the sweep at 0 instead of 1 MHz (which also shifts all but the last frequency point down a bit), the inversion problem seems to happen much less frequently, more like 0.005%, but it also affects lower frequencies when it does happen. This suggests something specific to certain frequency values.

This very much matches my theory: If you start from 0 instead of 1 MHz, all the points in the sweep will have slightly different frequencies as well. This means the MAX2871 PLLs will have problems with different frequency points and thus the 2.LO frequency is also switched at different frequencies.

I think it boils down to two questions:

  1. Can two outputs of the Si5351C actually have a 180° phase difference after performing a PLL reset? If this is not the case, my whole theory is wrong and I have to start looking somewhere else.
  2. Can I do something in the I2C commands to the Si5351C to prevent such a phase difference? If not, this will be very difficult or impossible to fix.

@miek
Copy link

miek commented Dec 20, 2024

If I'm following correctly, if the "suppress peaks" option is disabled then 2.LO would be left alone during the sweep and not reconfigured? It sounds like it'd be worth re-trying the original experiment with that option disabled to help test the theory.

@jankae
Copy link
Owner

jankae commented Dec 20, 2024

Yep, that should get rid of inverted S parameters if my theory is correct. It would of course also result in pretty much random spikes in all S parameters at the frequencies where the IF is then outside of the IF bandwidth.

@jankae
Copy link
Owner

jankae commented Dec 20, 2024

  1. Can two outputs of the Si5351C actually have a 180° phase difference after performing a PLL reset? If this is not the case, my whole theory is wrong and I have to start looking somewhere else.

Question 1 answer: Looks like that can indeed happen.

Setup: Scope attached to all three 2.LO outputs. Si5351C is just configured in a loop to ramp from 61 MHz to 62 MHz in 1 Hz steps. After each frequency change, the PLL is reset to align the phase. The scope triggers on the rising edge of the reference LO (yellow):
ScreenImg
Ignore the phases where the signal is a constant low, the outputs automatically mute during the PLL reset. I had to turn on persistence, but you can clearly see that the other two outputs (red and cyan) are sometimes 180° out of phase.

Let's hope I can find a good answer to question 2, otherwise this will turn out to be quite the nightmare issue.

@jankae
Copy link
Owner

jankae commented Dec 20, 2024

Updates with some more thoughts and discoveries: (partially for transparency, partially as just a good spot to write it down for myself)

This is the block diagram of the Si5351C:
image
From left to right:
There are two inputs from which all clocks are derived: XA/XB and CLKIN. XA/XB is driven by a TCXO in the LibreVNA (yes, it can also be driven directly by a clock, doesn't have to be a crystal). CLKIN is directly connected to the external reference input.
Following the inputs, there are two PLLs: PLL A and PLL B. Each PLL can generate a frequency between 600 and 900 MHz from either one of the clock inputs. In my firmware, both PLLs are either driven by XA/XB (when using the internal reference) or CLKIN (when using the external reference). These PLLs have a reasonable high resolution fractional divider, they can generate frequencies with $\frac{1}{1048575}*f_{in}$ spacing ($f_{in}$ being the frequency of the corresponding PLL input).

The 8 outputs can be driven by either PLL (or directly from the clock input, but that is not relevant here). Between the PLLs and the outputs are again some "multisynth" dividers. These can be though of as frequency divider, capable of dividing down the PLL frequency with a divider factor of 8 to 2048. These dividers have a resolution of $\frac{1}{1048575}$ again, so they can pretty much reach any desired frequency with enough accuracy.
Outputs 6 and 7 are restricted to integer division ratios between 6 and 254.

In the LibreVNA firmware, PLL A is set to a fixed 800 MHz and PLL B to 832 MHz. The output assignment is this:

  1. Lowband source, generates the stimulus signal below 25 MHz
  2. 2.LO reference receiver, driven by PLL B (832 MHz)
  3. 2.LO port 1, driven by PLL B (832 MHz)
  4. Source PLL reference (104 MHz), driven by PLL B (832 MHz)
  5. 2.LO port 2, driven by PLL B (832 MHz)
  6. 1.LO reference (104 MHz), driven by PLL B (832 MHz)
  7. External reference out, driven by PLL A (800 MHz)
  8. FPGA clock (16 MHz), driven by PLL A (800 MHz)

The goal here is to be able to generate an arbitrary frequency at clock outputs 1,2 and 4 with identical phases. I do this by changing the multisynth dividers for these outputs (this gets me the correct frequency but with arbitrary phase between them) followed by a PLL B reset (this aligns the phases).
The PLL reset is not described in more detail in the (somewhat confusing) datasheet and although it does align the phases, it sometimes aligns some of the outputs at a 180° phase.

Experiments showed that the likelihood of this phase reversal depends on the output frequency, or rather the divider setting: for integer division ratios, it never appears. For settings that are almost an exact integer, it is less likely and can take several minutes to happen (with frequency changes every millisecond). For divider settings that are exactly between two integers, it happens a lot more.

So, new idea: if it never happens with integer dividers, keep the divider between the PLL and the 2.LO outputs strictly at integer values. Obviously we need better frequency resolution at the output but we can change the PLL frequency instead. Since there are other outputs driven by PLL B as well (and they should not change their frequencies), they have to be moved to PLL A instead. This reserves PLL B completely for the 2.LO.

I'll try to adjust this in the firmware.

@jankae
Copy link
Owner

jankae commented Dec 20, 2024

Some progress:
I switched the PLLs around as mentioned in the last comment. First the scope test again. The PLL frequency is ramped from 61 to 62 MHz in 1 Hz steps. This has been going on for quite some time and the ramp repeated a couple of times.
ScreenImg(1)
This is with infinite persistance again and you can see the blurred edges on the right side of the traces because the frequency is shifting slowly. There have been millions of frequency changes and not a single phase inversion, so that is a good sign.

And then the test against the scripts provided here. I ran them first with the standard 1.6.1 firmware and did get several bad points:
bad-results.txt
Without having looked too closely at the results, this looks very similar to the already reported problem: occasional inverted points in the higher frequencies. Although I also have several points with other phase shifts in there, not sure what is going on with them.

And the same test again with the firmware modifications. Well, there still are some bad points (5 compared to the >14000 before). But a lot less than before, none at 180° anymore and with wildly different magnitudes instead:
bad-results.txt
I am pretty sure that this is a different problem.

The code changes in the firmware were actually rather small but I am hesitant to push that to the repo just yet. Changing something so deep in the frequency generation could have unintended side effects I have not considered so far and I would like to test this better (which I will likely not get done in the remainder of this year).
Here is the binary if you want to experiment already:
combined.zip

@scott-guthridge
Copy link
Contributor Author

scott-guthridge commented Dec 20, 2024

Although I also have several points with other phase shifts in there [...]

I also saw some phases all over the place in just one experiment. This was one of the cases where the start frequency was set to zero. Edit: note that the frequency points with strange phases weren't at the low frequencies though: they ranged from 312MHz to 6GHz.

@jankae
Copy link
Owner

jankae commented Jan 6, 2025

It took a bit more work than expected but I tested the firmware change a bit more and fixed some unexpected FPGA issues (which potentially could result in the occasional wrong value but I could not reproduce this reliably). The latest commit includes the reworked 2.LO generation and should not exhibit phase reversal anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants