Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipinfusion OcNOS 6.4.1-37 not working #177

Open
tommiyau opened this issue Mar 16, 2024 · 7 comments
Open

ipinfusion OcNOS 6.4.1-37 not working #177

tommiyau opened this issue Mar 16, 2024 · 7 comments

Comments

@tommiyau
Copy link

tommiyau commented Mar 16, 2024

I shad already changed the makefile pattern match. The image appears to start up and run but never progresses past login. From docker logs it does appear to login but then never gets pass the waiting for > prompt. I'm still trying to debug the launcher to get sme more information. If I spin the qcow up in kvm the prompt post login is OcNOS> so think that if it is pattern matching on > that should work but it appears to hang. Additional there is no ssh accessibility even after 10 minutes. In kvm it take less than a minute to become accessible. Just thought I'd log the issue in case I never solve it.

2024-03-16 00:05:02,643: launch TRACE OUTPUT: Starting agent_daemon service...
[ OK ] Started agent_daemon service.

2024-03-16 00:05:07,974: launch DEBUG matched login prompt
2024-03-16 00:05:07,974: launch DEBUG trying to log in with 'ocnos'
2024-03-16 00:05:07,974: vrnetlab DEBUG writing to serial console: 'ocnos'
2024-03-16 00:05:07,974: vrnetlab TRACE waiting for 'Password:' on serial console
2024-03-16 00:05:08,023: vrnetlab TRACE read from serial console: ' ocnos
Password:'
2024-03-16 00:05:08,023: vrnetlab DEBUG writing to serial console: 'ocnos'
2024-03-16 00:05:08,023: launch INFO applying bootstrap configuration
2024-03-16 00:05:08,023: vrnetlab DEBUG writing to serial console: ''
2024-03-16 00:05:08,023: vrnetlab TRACE waiting for '>' on serial console

@tommiyau
Copy link
Author

well tried putting in further debug statements in launch.py thinking maybe the startup passed the waiting for the > prompt. However, it never passes that state. The prompt is correct if I run the qcow under kvm. This places the issue back into vrnetlab.py. I have also bumped up the cores and memory which are hard coded in this lab but unfortunately that makes no difference. I have had the VM boot once in containerlab.....subsequent restart of the lab and it failed again at the same place. So this appears to be some potential race condition in vrnetlab. I've rebooted under kvm about 30 times now and the vm has never failed to start....so something explicitly in vrnetlab for this vm unfortunately.....when the VM did boot it ran successfully and ssh worked etc....the docker container itself is running and functioning so something definitely in vrnetlab and the serial prompt connectivity by the looks. Running out of time to try and resolve this unfortunately.

@tommiyau
Copy link
Author

i'm assuming that the vm is dying after login attempt for some reason.....i have tried leaving the container for over an hour but no luck....from the container I can't local ssh to the forwarded vm port...so assuming that the vm is not running properly inside of the container.

@tommiyau
Copy link
Author

well it is interesting....the problem is associated with the boot of the vm in the container first time around. I can monitor the docker logs and the first boot sits and waits at the waiting for > prompt. If i exec into the container and kill the qemu instance it automatically restarts and runs properly from thern....so now a bit lost as to ow to get this going.

@tommiyau
Copy link
Author

Looking at dmesg inside ocnos launcher container and seems its doing a stack dump. This only occurs with the launcher container by the looks. I dont think that its to do with my system because i can run up xrv9k adjacent to this without a problem.

starting to get passed me. Maybe there is some extra qmeu config or something that may make the image reliable inside a container and run under qemu but appears to be unstable. If the container does not dump then the image appears to run fine...this unfortunately is a rare occurrence.

[ 527.653384] ------------[ cut here ]------------
[ 527.653394] WARNING: CPU: 5 PID: 197 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:673 kvm_mmu_notifier_change_pte+0xb0/0x2b0 [kvm]
[ 527.653532] Modules linked in: act_mirred cls_u32 sch_ingress xt_nat xt_tcpudp nf_conntrack_netlink veth xt_comment rfcomm xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink bnep vsock_loopback vmw_vsock_virtio_transport_common intel_rapl_msr vmw_vsock_vmci_transport vsock intel_rapl_common binfmt_misc vmw_balloon kvm_intel kvm rapl joydev input_leds serio_raw snd_ens1371 btusb btrtl snd_ac97_codec gameport btbcm btintel snd_rawmidi snd_seq_device bluetooth ac97_bus snd_pcm ecdh_generic ecc snd_timer snd soundcore vmw_vmci mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua overlay iptable_filter ip6table_filter ip6_tables br_netfilter bridge stp llc arp_tables msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
[ 527.653697] linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic aesni_intel vmwgfx crypto_simd cryptd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops usbhid cec mptspi ahci rc_core mptscsih psmouse hid libahci mptbase e1000 drm scsi_transport_spi i2c_piix4 pata_acpi
[ 527.653755] CPU: 5 PID: 197 Comm: uksmd Not tainted 5.15.126-uksm #1
[ 527.653767] RIP: 0010:kvm_mmu_notifier_change_pte+0xb0/0x2b0 [kvm]
[ 527.653875] Code: 73 aa 48 8b 05 21 6f 0a 00 48 85 c0 74 0c 48 8b 78 08 48 89 d6 e8 e0 53 ff ff 48 8b 45 a0 48 8b 80 a8 b5 fe ff 48 85 c0 75 92 <0f> 0b 48 8b 5d a0 48 8b 43 40 48 85 c0 74 90 48 8d 83 80 00 00 00
[ 527.653881] RSP: 0018:ffffb135c07d7c50 EFLAGS: 00010246
[ 527.653888] RAX: 0000000000000000 RBX: ffffb135c56b7ac0 RCX: 800000024ff71867
[ 527.653892] RDX: 00007fea98196000 RSI: ffff9ba746e0b300 RDI: ffffb135c56b7ac0
[ 527.653896] RBP: ffffb135c07d7cd8 R08: 0000000000000000 R09: ffff9ba7d8c00000
[ 527.653900] R10: 000000000000000b R11: ffff9ba7d8c10010 R12: 00007fea98196000
[ 527.653903] R13: 800000024ff71867 R14: 00007fea98196000 R15: 0000000000000000
[ 527.653908] FS: 0000000000000000(0000) GS:ffff9bb63df40000(0000) knlGS:0000000000000000
[ 527.653913] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 527.653918] CR2: 0000565516d291b0 CR3: 0000000119d76006 CR4: 00000000007726e0
[ 527.654008] PKRU: 55555554
[ 527.654017] Call Trace:
[ 527.654021]
[ 527.654027] ? show_regs.cold+0x1a/0x1f
[ 527.654047] ? kvm_mmu_notifier_change_pte+0xb0/0x2b0 [kvm]
[ 527.654217] ? __warn+0x8c/0x100
[ 527.654231] ? kvm_mmu_notifier_change_pte+0xb0/0x2b0 [kvm]
[ 527.654332] ? report_bug+0xa4/0xd0
[ 527.654345] ? handle_bug+0x39/0x90
[ 527.654355] ? exc_invalid_op+0x19/0x70
[ 527.654362] ? asm_exc_invalid_op+0x1b/0x20
[ 527.654371] ? kvm_mmu_notifier_change_pte+0xb0/0x2b0 [kvm]
[ 527.654470] ? tlbflush_read_file+0x80/0x80
[ 527.654479] ? kvm_arch_mmu_notifier_invalidate_range+0x21/0x50 [kvm]
[ 527.654612] __mmu_notifier_change_pte+0x55/0x90
[ 527.654624] restore_uksm_page_pte+0x262/0x280
[ 527.654633] scan_vma_one_page+0x1120/0x2940
[ 527.654644] uksm_do_scan+0x15b/0x3060
[ 527.654652] ? del_timer_sync+0x6c/0xb0
[ 527.654665] ? __bpf_trace_tick_stop+0x20/0x20
[ 527.654675] uksm_scan_thread+0x161/0x1a0
[ 527.654684] ? __kthread_parkme+0x4b/0x70
[ 527.654694] ? uksm_do_scan+0x3060/0x3060
[ 527.654702] kthread+0x127/0x150
[ 527.654712] ? set_kthread_struct+0x50/0x50
[ 527.654721] ret_from_fork+0x1f/0x30
[ 527.654735]
[ 527.654738] ---[ end trace c592975c1cd3946d ]---

@tommiyau
Copy link
Author

even without the dump the ocnos boot does fail and everything in dmesg looks ok. basically to get to run you need to exec into the container and kill qemu process. it will restart automatically and then run fine. Something with vrnetlab/hellt and maybe upstream...

@tommiyau
Copy link
Author

stack dump comes from having a second nic in the system that was not connected to anything. removed that and now no more dumps but the launcher od always stops at waiting for a > prompt until you kill the qemu process works post that.

@tommiyau
Copy link
Author

appears that there is a race condition with launch.py and the VM itself. The VM gives the login prompt, launch.py hits it before the VM is fully ready resulting in the VM appearing to hang. The answer here is to patch the launch.py file and add a delay post detecting the login prompt. Currently I just made this 15 seconds and haven't tried to tune it to a minimum. I'd rather the VM just start than save a couple of seconds.

https://github.com/hellt/vrnetlab/blob/master/ocnos/docker/launch.py

Change
if match: # got a match!
if ridx == 0: # login
self.logger.debug("matched login prompt")
self.logger.debug("trying to log in with 'ocnos'")
self.wait_write("ocnos", wait=None)
self.wait_write("ocnos", wait="Password:")

to the following after importing time.
if match: # got a match!
if ridx == 0: # login
self.logger.debug("matched login prompt")
time.sleep(15)
self.logger.debug("trying to log in with 'ocnos'")
self.wait_write("ocnos", wait=None)
self.wait_write("ocnos", wait="Password:")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant