Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chef-run behavior when automate is configured but unreachable #80

Open
ericcalabretta opened this issue Nov 20, 2018 · 3 comments
Open
Labels
Triage: Confirmed Indicates and issue has been confirmed as described. Type: Bug Does not work as expected.

Comments

@ericcalabretta
Copy link

ericcalabretta commented Nov 20, 2018

Description

Chef-run fails when data_collector is configured in .chef-workstation/config.toml but the target is unreachable.

Resources are not converged, and the error message does not identify the data_collector has the root cause. It'll be a common scenario for an ad-hoc tool that some targets will be able to report to automate, and others won't depending on networks, firewalls, proxies, etc. Also the .config.toml may be set by an organization and individual users may not be intimately aware of each setting.

Example config.toml:

[data_collector]
url="https://automate-bad/data-collector/v0/"
token="12345678910"

Example chef-run command:

chef-run ssh://vagrant:vagrant@centos package ntp
[✔] Packaging cookbook... done!
[✔] Generating local policyfile... exporting... done!
[✖] Applying package[ntp] from resource to target.
└── [✖] [centos] Failed to converge package[ntp].

The converge of the remote host failed.

Please examine the log file for a detailed cause of failure.

InSpec test showing NTP was not installed: 
 inspec exec test.rb -t ssh://vagrant:vagrant@centos

Profile: tests from test.rb (tests from test.rb)
Version: (not specified)
Target:  ssh://vagrant@centos:22

  System Package ntp
     ×  should be installed
     expected that `System Package ntp` is installed

Test Summary: 0 successful, 1 failure, 0 skipped

The default.log file in chef-workstation/logs does provide meaningful information, but a user would need to
know where to look.

Example default.log

Running handlers complete
[2018-11-20T17:15:53+00:00] ERROR: Exception handlers complete
Chef Client failed. 0 resources updated in 26 seconds
[2018-11-20T17:15:53+00:00] ERROR: Error connecting to https://automate-bad/data-collector/v0/, retry 1/5
[2018-11-20T17:15:58+00:00] ERROR: Error connecting to https://automate-bad/data-collector/v0/, retry 2/5
[2018-11-20T17:16:03+00:00] ERROR: Error connecting to https://automate-bad/data-collector/v0/, retry 3/5
[2018-11-20T17:16:08+00:00] ERROR: Error connecting to https://automate-bad/data-collector/v0/, retry 4/5
[2018-11-20T17:16:13+00:00] ERROR: Error connecting to https://automate-bad/data-collector/v0/, retry 5/5
[2018-11-20T17:16:18+00:00] FATAL: SocketError: Error connecting to https://automate-bad/data-collector/v0/ - Failed to open TCP connection to automate-bad:443 (getaddrinfo: Name or service not known)

[2018-11-20T11:16:19-06:00] ERROR: stderr: 
[2018-11-20T11:16:19-06:00] ERROR: Remote chef-client error follows:
[2018-11-20T11:16:19-06:00] ERROR: SocketError: Error connecting to https://automate-bad/data-collector/v0/ - Failed to open TCP connection to automate-bad:443 (getaddrinfo: Name or service not known)

Chef Workstation Version

Chef Workstation: 0.2.29

Platform Version

Chef-workstation running on macOS 10.14.1

@robbkidd
Copy link
Contributor

Thanks for reporting, @ericcalabretta. I'm going to move this issue over to chef-run's repo.

@robbkidd robbkidd transferred this issue from chef/chef-workstation May 15, 2019
@tyler-ball tyler-ball added Aspect: Correctness Triage: Confirmed Indicates and issue has been confirmed as described. Type: Bug Does not work as expected. labels Sep 9, 2019
@tyler-ball
Copy link
Contributor

Thanks @ericcalabretta - I think you helped highlight a bunch of issues around this. I think it should not fail the chef-client run if the automate server is unreachable and we need better logging around this issue.

Some questions for us to answer around implementation - should users be able to configure that an unreachable data collector does cause chef run failures?

@ericcalabretta
Copy link
Author

@tyler-ball I'd expect/want the default behavior to allow the run to complete but the output should obviously identify it was unable to reach Automate. I'd expect a similar error message as we have when the chef-client can't log to Automate today.

I do think allowing uses to configure the failure behavior would be helpful, some users may want to "fail fast" and know Automate is unavailable as fast as possible. I think this is a secondary use case though and most users would want the run to succeed & notify the user it was unable to log to Automate.

In most cases Chef performing its actions is the primary need, and the reporting or observability is a secondary concern. You may want to still do your config mgt tasks, while you wait for the firewall team to open that port to Automate for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triage: Confirmed Indicates and issue has been confirmed as described. Type: Bug Does not work as expected.
Projects
None yet
Development

No branches or pull requests

3 participants