Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant boot vm using upstream autotest when nics number get larger #1058

Open
ypu opened this issue Nov 12, 2013 · 25 comments
Open

Cant boot vm using upstream autotest when nics number get larger #1058

ypu opened this issue Nov 12, 2013 · 25 comments
Assignees

Comments

@ypu
Copy link
Member

ypu commented Nov 12, 2013

As now autotest always get vm object from env and virtnet is not updated. We will met problem when we run tests like this:

  1. run boot (which will start a guest with one nic)
  2. keep the env file and run boot again but change the cfg file to use two nics(add nics += " nic2" in cfg file)

we will meet this problem as in the second test the prepare work for nics get virtnet object from the pre vm object from last case. :

cmdline for second case:
-device virtio-net-pci,mac=9a:e3:e4:e5:e6:e7,id=idfKDrOm,netdev=idhlb9is,bus=pci.0,addr=06
-netdev tap,id=idhlb9is,vhost=on,vhostfd=21,fd=20
-device virtio-net-pci,mac=9a:cb:cc:cd:ce:cf,id=idXVv625,netdev=idBRKPax,bus=pci.0,addr=07
-netdev tap,id=idBRKPax,vhost=on,ifname='t1-PGkoOH',downscript='no' \

and qemu-kvm will failed to start the guest:
20:09:21 INFO | [qemu output] /etc/qemu-ifup: could not launch network script
20:09:22 INFO | [qemu output] qemu: -netdev tap,id=idBRKPax,vhost=on,ifname=t1-PGkoOH,downscript=no: Device 'tap' could not be initialized
20:09:22 INFO | [qemu output](Process terminated with status 1)

But what we expect is guest start with two nics both using tapfd.

@ypu
Copy link
Member Author

ypu commented Nov 12, 2013

@cevich this is the problem of #978 try to resolve
We also met another problem with nic in guest_s4 and @humanux will update the details later.

@ghost ghost assigned cevich Nov 12, 2013
@cevich
Copy link
Member

cevich commented Nov 12, 2013

Just to make sure I'm not missing anything, you're starting out clean, no env file and no /tmp/address_pool file correct? All config files are default, except you changed base.cfg: nettype=user to`nettype=bridge``, correct?

Then, you run boot test, a VM instance is created, /tmp/address_pool is created, env is created. Test finishes as PASS.

Next, you change base.cfg: nics += " nic2". Leaving env and /tmp/address_pool in place.

Finally, you run boot test, and it breaks as described above.

Can you confirm the MAC address of the first nic in the first test is exactly the same as the first nic in the second test?
After the first boot test finishes, is the VM left running or does it shutdown? Does the second test start or restart the VM? Any idea why the second nic seems to have different parameters than the first, are those being set in config somewhere?

@humanux
Copy link
Contributor

humanux commented Nov 13, 2013

@cevich
your step is correct

Can you confirm the MAC address of the first nic in the first test is exactly the same as the first nic in the second test?

yes, the first nic mac is same

After the first boot test finishes, is the VM left running or does it shutdown?

no, the vm is running

Does the second test start or restart the VM?

yes, the second test will restart the vm.

Any idea why the second nic seems to have different parameters than the first, are those being set in config somewhere?

in our env_process preprcocess_vm will using the vm.virtnet directly when create new vm, the virnet of the new vm dosen't do any init.

@humanux
Copy link
Contributor

humanux commented Nov 13, 2013

another issues:
@cevich

issue1:

  1. add queues =4 in cfg file, then run 'boot' test
  2. after test finished, modify the cfg file, just delete the line 'queues =4', then run boot test again
    you will find the vm also boot with queues=4 in second tests.

issue2:

  1. run guest s4 test with queues =4
    will raise error like "tap can not init" when resume guest

@FengYang
Copy link
Member

On 11/12/2013 08:14 PM, Yiqiao Pu wrote:

As now autotest always get vm object from env and virtnet is not
updated. We will met problem when we run tests like this:

  1. run boot (which will start a guest with one nic)
  2. keep the env file and run boot again but change the cfg file to use
    two nics(add nics += " nic2" in cfg file)

we will meet this problem as in the second test the prepare work for
nics get virtnet object from the pre vm object from last case. :

cmdline for second case:
-device
virtio-net-pci,mac=9a:e3:e4:e5:e6:e7,id=idfKDrOm,netdev=idhlb9is,bus=pci.0,addr=06

-netdev tap,id=idhlb9is,vhost=on,vhostfd=21,fd=20
-device
virtio-net-pci,mac=9a:cb:cc:cd:ce:cf,id=idXVv625,netdev=idBRKPax,bus=pci.0,addr=07

-netdev tap,id=idBRKPax,vhost=on,ifname='t1-PGkoOH',downscript='no' \

and qemu-kvm will failed to start the guest:
20:09:21 INFO | [qemu output] /etc/qemu-ifup: could not launch network
script
20:09:22 INFO | [qemu output] qemu: -netdev
tap,id=idBRKPax,vhost=on,ifname=t1-PGkoOH,downscript=no: Device 'tap'
could not be initialized
20:09:22 INFO | qemu output <Process%20terminated%20with%20status%201>

But what we expect is guest start with two nics both using tapfd.

I remember we used to fix similar issue before. VM always try to use
old parameter.

Now it happens in network part.


Reply to this email directly or view it on GitHub
#1058.

@ypu
Copy link
Member Author

ypu commented Nov 13, 2013

@FengYang The old fix only makes the command line generate function works in the right way, but the prepare work for tapfds are not covered.

@cevich
Copy link
Member

cevich commented Nov 13, 2013

On 11/12/2013 09:36 PM, Yunping Zheng wrote:

@cevich your step is correct

Can you confirm the MAC address of the first nic in the first test
is exactly the same as the first nic in the second test?
yes, the first nic mac is same
After the first boot test finishes, is the VM left running or does
it shutdown?
no, the vm is running
Does the second test start or restart the VM?
yes, the second test will restart the vm.
Any idea why the second nic seems to have different parameters than
the first, are those being set in config somewhere?
in our env_process preprcocess_vm will using the vm.virtnet directly
when create new vm, the virnet of the new vm dosen't do any init.

Excellent, I see how this problem is happening, it is caused by two things:

  • The local persistent cache /tmp/address_pool is updated by incoming params. Any items removed from params, but still present in cache will not be noticed. Instead, both will be mixed together which is not correct behavior.
  • Much of the virtnet setup is dependant on code in vm.__init__() but un-pickling does not automatically call it - so we have an invariant problem (something that should be changing but is not).
  • This is also a problem affecting libvirt testing, but worse-so: The vm's XML definition is never checked, and will additionally clash with address_pool and params.

The bad news is, the problem is embedded deep inside virtnet code and bound up with incorrect responsibility delegation (i.e. virtnet code tries to be "smart" in some areas). The good news is I believe I have a good understanding of what is causing these problems, and am far along into fixing it at the root:

  1. Un-tangle responsibility delegation between virtnet and vm classes (done)
  2. Update virtnet unittests to verify state is not mixed up (working on this now)
  3. Update vm classes & env_process to account for changes (this is where I'll need virt.virtnet : update vm.virtnet when guest reboot #978)

You can follow my changes here: https://github.com/cevich/virt-test/compare/autotest:next...virtnet_modernize

@cevich
Copy link
Member

cevich commented Nov 13, 2013

@ypu Status: fixed bug from second unittest class, and finished writing third class. Found a few more bugs and fixed all except one. Also remembered I'll need a fourth unittest class for libvirt, but hopefully that one will be easier after having written the first three. Hopefully I can finish up all four by the end of tomorrow, then it's onto making them all work in the VM classes (which hopefully means deleting a lot of crusty code).

High-level of what I'm doing: Re-writing all the virtnet stuff to be more "dumb" and not try to make decisions. Instead, the logic of which networking data to use will be ONLY in vm class. This will make virtnet much simpler to use and easier for vm classes to take the correct actions.

@ypu
Copy link
Member Author

ypu commented Nov 14, 2013

@cevich You are so kind to let me know the status of this problem. Will following the virtnet_modernize branch and do some tests with it. Thanks a lot.

@cevich
Copy link
Member

cevich commented Nov 14, 2013

@ypu You said this was critical problem, no? Today I fixed many bugs and finished forth unittest class. However, I run tools/run_unittests.py to check, and found I have broken some other things on accident. I will investigate those, then tomorrow as I hoped, I can start getting all these virtnet changes into the VM classes w00t!

@cevich
Copy link
Member

cevich commented Nov 15, 2013

Okay, I just figured out the unittest failures are a timing issue uncovered by my un-related changes. Opened this issue #1074 Now I can start integrating my changes into virt_vm, libvirt_vm, and elsewhere :D

@cevich
Copy link
Member

cevich commented Nov 19, 2013

@ypu Sorry, forgot to update this yesterday. I've taken a bunch of bug fixes and small enhancements our of my patchset, and separated them to their own PR #1075 and opened an issue for the unittest concurrency problem I found as #1074. Assuming those can be handled in parallel, now I can focus on integration w/ a smaller set of changes on my branch.

@cevich
Copy link
Member

cevich commented Nov 19, 2013

@ypu Oh! I just found the cause for all this! A long while ago when I wrote virtnet, I made vm.__init__() to always call super(VM, self).__init__(...). That was required for the needs_restart() mechanism to work properly. I see now, at some point, somebody removed/reverted that. Though anyway, this code hasn't received much love recently, so I've got out the hatchet and am deleting all the cobwebs I can find :) Then I'll see what breaks and fix it more sensibly. If that approach fails, I've got a backup plan.

@cevich
Copy link
Member

cevich commented Nov 20, 2013

@ypu Today I learned of a tricky circular dependency that was causing unrelated unittests to fail, but only when run in parallel. Turns out, it was related to some changes I made in my branch. I had to revert those and re-implement a small section + it's unittest. Still, good to catch such a nasty problem early, yay unittests!

My work from yesterday was committed upstream, simplifying my branch. I started working through the virt_vm class, cleaning and getting it ready for the new virtnet. Tomorrow I will continue that work.

@cevich
Copy link
Member

cevich commented Nov 21, 2013

@ldoktor I'm looking at updating virtnet in qemu_vm and libvirt_vm to fix a state-mismatch problem @ypu is running into. However, I've come across a few uses of your qemu_devices, and it reads like it is solving a very similar problem.

That got me thinking, reading through your code, and pondering why qemu_vm should have two completely different object models for devices? I don't see a network-device implementation, but IIRC, you said qemu_devices were fully pickle-able, no? What's your time/feeling/outlook on writing an interface device, then replacing virtnet with it in qemu_kvm and env_process?

N/B: On the libvirt-side, we don't really even need virtnet since our virsh and libvirt_xml modules are mature enough now to take over.

I guess I'm trying to look at the big picture, if both qemu_vmand libvirt_vm could have their own specialized and consolidated models, instead of me trying to make virtnet into a "does-everything" for all vm types. I also think this could mean deleting a lot of old code, and replacing it with a smaller amount of fresh code. If I'm right, this may also be a faster way to get @ypu a working fix. What do you think?

@lmr you remember how tricky this networking stuff is, what's your opinion on dumping virtnet, and using qemu_devices for this?

(I'll keep plugging away at what I'm doing in case the above won't work or isn't a good idea).

@ldoktor
Copy link
Member

ldoktor commented Nov 25, 2013

Hi @cevich, yes, qemu_devices is fully pickable and I'd like to see all related code inside of it. The problem is that I still have a lot to cover in disk/devices area before I can move to other/networking stuffs. Anyway if someone else here would like to pick this subject, I'd support him/her. Anyway if this should work we have to prepare some basics of what is required on autotest and hypervizor-related networking code. You said that libvirt is almost prepared to takeover the control, so it could be a good starting point (please apology my ignorance in case it's already described somewhere, I'm avoiding networking as much as possible)

@cevich
Copy link
Member

cevich commented Dec 2, 2013

@ldoktor hahahaha, but will networking avoid you! It is very tricky with many corner cases, so I understand. Though by my reading of what you have already in vm.create and elsewhere, it seems "almost there". The problems we have are:

  1. Networking-related params must be parsed in uniform way (for all of virt-tests)
  2. Multiple VM's must have different & multiple networking setups including hardware and configs.
  3. There must not be accidental clashes of allocated resources (FD's, MAC addresses, etc).
  4. When VM instance is recovered from Env, it must use previous networking setup from address_pool (db).
  5. Detect when a test's networing params are same or different from existing qemu process / VM instance.

The problem @ypu is having comes from current solutions to No. 2 and 5 (however, these have dependencies on the other items). As I see it currently, qemu_devices already has ability for almost all of these. The only parts that seem to be missing in qemu-devices is No 4. and then using it in vm.needs_restart().

In other words, saving the qemu_devices networking info (outside of Env), then using it again in vm.needs_restart() to check if incoming params match saved representation. Using your more qemu-specific code, instead of my more generic code to do the saving and comparing. Since in the end, this is the failing part for @ypu (virtnet is failing to spot a difference between params and saved-representation).

Do you think you could manage stripping out VirtNetDB, moving that functionality into qemu_devices, and update vm.needs_restart()?

@cevich
Copy link
Member

cevich commented Dec 2, 2013

@ldoktor yeah, looking at this more, it really seems to me that qemu_vm doesn't need VirtNetDB at all. It can just parse params into qdevice. Then needs_restart can just check self.devices == other.devices. The main job of VirtNetDB was this and making sure any generated mac addresses aren't handed out twice to different VM instances.

@lmr
Copy link
Member

lmr commented Dec 4, 2013

@ldoktor and @cevich, I've decided to push the interim solution by @humanux while you guys debate what is the best solution. After looking at the time line of this issue, I found not admissible that we've been with virt-test broken for so long for anything as trivial as running tests in sequence that change network configuration.

Feel free to revert the change when you have a proper solution.

@ypu
Copy link
Member Author

ypu commented Dec 4, 2013

@lmr thanks a lot for your efficient for helping with this problem. @cevich and @ldoktor Thanks for your discuss and time for resolve this problem. Still looking forward to your solution.

@cevich
Copy link
Member

cevich commented Dec 4, 2013

@ypu Okay, after discussing much with @lmr, he is on-board with having qemu-devices take over for a the address_pool part of virtnet in qemu_vm (formerly VirtNetDB). However, I just finished a first pass at migrating qemu_vm to use the new virtnet (+ minimal address_pool db stub). I noticed a lot of side-stepping around virtnet. This is worrying because the more things happen "on the side" the more we risk libvirt params interoperability problems (i.e. the whole point behind having common virtnet interface). So next time, just ask me for help, I'm nearly always willing.

Basically what I've done is simplify virtnet into just a params parser and collection of "helpers" for the vm and qemu-devices to consume. Once this reduction is working, @ldoktor can set about migrating the minimal virtnet address_cache db use over/info qemu-devices (should be relatively simple task).

Here's my latest work (likely broken in some spots): https://github.com/cevich/virt-test/compare/virtnet_reduction

I'll keep hammering away at this effort, to account for just a few small virtnet interface changes. There's bound to be some breakage, but not a lot that I haven't dealt with in the past. I'll keep y'all posted on my progress.

@cevich
Copy link
Member

cevich commented Dec 5, 2013

@ypu I've got most of the basic qemu-kvm tests "working", including multi-nic. However, as I know from experience there are likely some corner-cases and odd situations I need to test for. Then I'll need to go over libvirt and get it updated as well (should be much quicker than qemu_vm though). Finally, I'll start going through and polishing up my commits so we can test, test, test, test, test and maybe merge them soonish :)

@cevich
Copy link
Member

cevich commented Dec 9, 2013

@ypu Amazingly, after I got past a few unrelated problems, I am running qemu-kvm tests for boot, ping.multi_nics, reboot, and shutdown with no problem and new networking code. I decided to polish up commit messages today, and get PR ready so more testing can be done by other people. While that testing happens, I will work on getting libvirt tests to work again, with new code.

@cevich
Copy link
Member

cevich commented Dec 10, 2013

@ypu @lmr @ldoktor Yay! First draft: #1217 Please check it out when you have time. I know it's HUGE set of changes, so I'm opening it up early so y'all can pick away at it in more manageable chunks over a longer period of time. I have no expectations of merging this quickly, I've spent a lot of time trying to get this "right" and want the merge to happen that way too. Thanks!

@cevich
Copy link
Member

cevich commented Dec 10, 2013

@ldoktor As far as qemu_devices go, let me know what you think about using it to replace the BaseVM's use of VirtnetDB. It's really qemu-kvm specific and probably doesn't need to be in the base module. It's only purpose is to make sure the same vm.instance gets the same mac addresses for it's corresponding nics over a test job. I'm also open to just leaving it this way if you see it as a huge burden to incorporate into qemu_devices, of if you don't think it makes sense to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants