Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpha 17 B5eek: Something weird about fan-speeds #164

Open
Freihut opened this issue Sep 22, 2024 · 19 comments
Open

Alpha 17 B5eek: Something weird about fan-speeds #164

Freihut opened this issue Sep 22, 2024 · 19 comments
Labels
bug Something isn't working

Comments

@Freihut
Copy link

Freihut commented Sep 22, 2024

Laptop model

Alpha 17 B5eek

EC firmware version

17LLEMS1.106

Description

Tl;dr
cpu-fan-speed: seems incorrect
gpu-fan-speed: plausible, but somehow not in "turbine mode"


I've got some weird readings here:

Situation 1:
Created some cpu-load while running:
watch --interval 1 cat /sys/devices/platform/msi-ec/cpu/realtime_fan_speed

43
76
96
cat: /sys/devices/platform/msi-ec/cpu/realtime_fan_speed: Invalid argument

(combined output of several seconds)

Pluma (a text editor) also throws the "Invalid argument" at the same time, so likely not a cat issue.

Situation 2:
Idle + FN + Arrow up (which makes the fans go into "turbine mode") but msi-ec/cpu/realtime_fan_speed reports "43", while msi-ec/gpu/realtime_fan_speed reports "0".


Meanwhile I get the attached output while reading the ec (/sys/kernel/debug/ec/ec0/io) by a small pascal prog I used before.

Line 1 = the dump of the whole ec-line
Line 2 = the gpu-rpm-speed
Line 3 = the cpu-rpm-speed
Interval is 1000ms.

output1.txt idling laptop, just going into "turbine mode" and went back to normal after some seconds. Msi-ec reports "43" for cpu and "0" for gpu all along.

output2.txt laptop has full cpu load. Cpu-fan is around 3900rpm, while gpu-fan is at 0 and gets turned on, when the gpu reached 55°C (as the case gets warmed up I guess).
Msi-ec reports "invalid" for cpu all the time and "0" for gpu in the beginning, later it went up to 43, which is kind of plausible.

The pascal prog I was using for around 1 year all the time, so I'm fairly sure the readings are correct, at least they're plausible.

I'm using the latest BIOS E17LLAMS.10B from 2023-06-15 with Arch Linux on Kernel 6.11.0

(the pascal prog src can be compiled with Lazarus; needs to be run as root (to read /sys/kernel/debug/ec/ec0/io) while ec_sys module is running)

output1.txt
output2.txt
read_ec.tar.gz

@Freihut Freihut added the bug Something isn't working label Sep 22, 2024
@glpnk
Copy link
Contributor

glpnk commented Sep 22, 2024

For many devices, RPM might be set to the wrong address and scaled incorrectly. Actually, EC show not RPM, but % of RPM in range 0-150. Someone in the past tried to "normalize" CPU % RPM to 0-100% range and now it returns some wrong values. Fans turned on-off accordingly to curve, with some hysteresis. Fan mode like silent/auto/advanced just limits max available %RPM to some value without any scaling

IDK what is turbine mode

@Freihut
Copy link
Author

Freihut commented Sep 22, 2024

rpm-readings vs percent-readings, isn't my point (I'm aware of that).

(1881/5558)×100=33%, msi_ec shows 43(%).
(3900/5558)×100=70%, msi_ec shows broken stuff ('invalid argument')

"turbine mode"=fans@maximum, done by FN+Arrow up

@glpnk
Copy link
Contributor

glpnk commented Sep 22, 2024

Not all devices have turbine mode, but many have cooler boost which might be same thing

Where you got 5558 number from?

MSI EC don't calculate percents (except broken CPU %RPM meter, which need to be removed)

Don't look onto CPU RPM reported by driver

@Freihut
Copy link
Author

Freihut commented Sep 22, 2024

Yes, it is the same thing.

5558 = the maximum cpu-fan-rpm (on "turbine mode") for my device, so 100%
1881 = idle, 3900 = cpu-rpm at maximum cpu-load.

The rpm values I've got from my own prog's readings as described in the initial post.

@glpnk
Copy link
Contributor

glpnk commented Sep 22, 2024

You can assume that boost speed isn't 100% but 150 or 200, plus correlation may be non-linear

@Freihut
Copy link
Author

Freihut commented Sep 23, 2024

No, I won't.
msi_ec is reading wrong values for this device (I guess they're target-fan-speeds) and doing wrong math with the wrongly taken cpu-fan-speed (by subtracting and dividing addresses) which can result in undefined behavior (for all devices).

@glpnk
Copy link
Contributor

glpnk commented Sep 23, 2024

MSI ec did not control fan curve, but I want to fix realtime %rpm readings soon

@Freihut
Copy link
Author

Freihut commented Sep 25, 2024

In the meantime affected people can use my forked repo for this device. Reads rpm values from the correct addresses.

@mutchiko
Copy link
Contributor

yeah well it's only logical that these addresses are messed up, i was so concentrated on getting shift_mode to work that i completely forgot about testing cpu/gpu fans speed addresses.

now that i remember correctly, i used ec_sys module readings for fans speeds, and not the actual driver itself.

by the way @Freihut your repo works kinda well, the realtime_fan_speed file in /sys/devices/platform/msi-ec/cpu/ is broken (impossible to open); same with the gpu file except it shows 0 all the time so you might want to check with that too.

@Freihut
Copy link
Author

Freihut commented Sep 26, 2024

now that i remember correctly, i used ec_sys module readings for fans speeds, and not the actual driver itself.

That's fine, as they both should read from the same source.
Letting the device idle and using
watch --interval 1 sudo xxd -g 1 /sys/kernel/debug/ec/ec0/io
(or maybe a smaller interval) while playing around with the turbines "cooler boost" is IMO the best way to find the fan-adresses.

repo works kinda well, the realtime_fan_speed file in /sys/devices/platform/msi-ec/cpu/ is broken (impossible to open);

That's were my changes are, so it's not "well" at all. :c
The code in my fork only works for the Alpha 17 b5eek (CONF22), because it needs .rt_fan_speed_fallback in .cpu = {} and .gpu = {} to be set. Haven't done this for the other devices, because I can't test this and meight be a device-specific-workaround.
If you're using the same hardware as me, 0xcd and 0xcb in your ec are not matching the fan-speeds.

same with the gpu file except it shows 0 all the time so you might want to check with that too

If /sys/devices/platform/msi-ec/gpu/realtime_fan_speed reports 0 and you're 100 % sure the GPU-Fan is running (GPU-Temp > 55°C or the coolerboost is on) then it also reads on a wrong address (0xcb) and therefore displays the fallback.

@mutchiko
Copy link
Contributor

@Freihut before i continue testing the fans speed readings with you, i'd like to confirm a few things in advance:

  1. output of sudo dmesg | grep error
  2. both iGPU and dGPU usage underload (notice anything wrong?)
  3. idle cpu temperature (after booting and logging in from a cold start)
  4. max power limit reported by nvtop or amdgpu top for the rx6600m
  5. any bios settings that you changed

please do all of these under linux, thanks.

P.S: what you call turbine mode is actually turbo boost.

@Freihut
Copy link
Author

Freihut commented Sep 28, 2024

1. output of `sudo dmesg | grep error`

just a bunch (less than 10) of ACPI Error: Aborting method \_SB.PCI0.SBRG.EC._Q9A due to previous error (AE_NOT_EXIST) (20240322/psparse-529) .

2. both iGPU and dGPU usage underload (notice anything wrong?)

What is that question for? That's reported by amdgpu (which's just passing firmware readouts) and more or less reasonable. ("More or less" because values reported by the firmware are "meh").

3. idle cpu temperature (after booting and logging in from a cold start)

Around 50°C, depending on room-temp.

4. max power limit reported by nvtop or amdgpu top for the rx6600m

According to amdgpu it is 65w. With Furmark and smartshift enabled I can push the dGPU to around 68w, but /sys/class/drm/card[X]/device/hwmon/hwmon7/power1_cap_max still reports 65w.

5. any bios settings that you changed

My device reports fan-rpm-speeds on 0xcb and 0xcd even for BIOS defaults.

Settings I've changed and can remember: Smartshift, secure boot and modern standby off, UMA for iGPU to 512Mib. But like I wrote: I used these addresses for about 1~2 years now and they never changed and always report plausible speeds. At least for my device.

P.S: what you call turbine mode is actually turbo boost.

Ya, I know, but turbine mode sounds better. :)

BTW, I just made a gui-tool to live view the ec. It highlights changes and does some math to help find fan-speed-addresses. But its pretty alpha right now.

@mutchiko
Copy link
Contributor

the reason i asked you these questions is that i'm trying to see if the driver is functioning properly before re-checking other addresses, for example: disabling smartshift from bios will prevent the ec from doing any actual performance changes when you change shift mode in the driver or in the msi dragon center, but will change the fans curves.

disabling modern standby will reset all the power/performance changes after waking up from sleep, you'll have to re apply them by re selecting the performance mode (shift mode) that you want; if its enabled, you should see an mp2 acpi error that is related to modern standby. thats why i asked you for acpi errors.

i asked you for gpu usage because the vbios has an issue that makes it report 99% on almost any load.

According to amdgpu it is 65w

seems like smartshift doesn't work on linux for some reason.

users of the alpha 15 reported that it works fine, after further searching i found out that the RX6600M vbios is different from the one found on the alpha 17 ; i assume that flashing alpha 15 vbios might fix the issue, but it might brick your laptop.

I just made a gui-tool to live view the ec

just tried it out and its really cool, hopefully it will make it easier for people to test if the driver is working correctly on their laptops or not, thanks for your work.

@Freihut
Copy link
Author

Freihut commented Sep 29, 2024

the reason i asked you these questions [...]

Thanks for explaining.

i asked you for gpu usage because the vbios has an issue that makes it report 99% on almost any load.

I can remember that this occured to me some days ago after standby. But I just tried to reproduce that and both gpus keep reporting sane utilization values. Weird. (No updates happened between these situations).

seems like smartshift doesn't work on linux for some reason.

It kinda does, but in a weird way and it keeps changing as the kernel progresses. 2 years ago smartshift shifted alot to the gpu (if I remember correctly it ran at about ~85w and the cpu dropped to 2,5 Ghz). With the current kernel it shifts about 3w, but very slowly (you can see that the gpus power draw increase over several minutes of load). Any value to the somethingbiassomething-file had no effect.

Smartshift also has some side effects on ryzenadj, but I couldn't figure out what exactly happens there.

just tried it out and its really cool, hopefully it will make it easier for people to test if the driver is working correctly on their laptops or not, thanks for your work.

Thanks for the feedback, I'm glad to help.

@mutchiko
Copy link
Contributor

I did my testing and @Freihut is right:

  • .rt_fan_speed_address = 0xcd for CPU target fan speed address
  • .rt_fan_speed_address = 0xcb for GPU target fan speed address

Values contained in these 2 addresses are percentages for the target speed, not actual speed in rpms;
the file /sys/devices/platform/msi-ec/cpu/realtime_fan_speed is unreadable if the target percentage is below 25% or above 55%.

There seems to be a mismatch between the values reported by ec_sys and msi-ec:
when target percentage is 25%, msi-ec reports 0%, and when target is 55%, msi-ec reports 100.

so its only possible to load the file if the target is between 25% to 55%.

@mutchiko
Copy link
Contributor

lets fix things one at a time, correct addresses take priority, @Freihut do you want me to fix it or do you want to make a merge request yourself?

@Freihut
Copy link
Author

Freihut commented Oct 26, 2024

Wait a minute, you can't just fix the addresses, because this needs a rather big overhaul in calculating the fan speeds.

Look at the way I calculate the rpm in my forked code.

But this works only for the Alpha 17 b5eek (and of course devices using the same fans). To fix this for all users you'll need to add the Fallback-rpm for each device currently supported or find the addresses to make msi-ec read that out by itself.

@mutchiko
Copy link
Contributor

@Freihut i think the only way to verify speeds is to use apps like HWMon on windows and compare the reading to the ones we have in linux.

you'll need to add the Fallback-rpm for each device currently supported

sounds like unnecessarily complicated solution to me.

i've seen how you calculate rpm and honestly i don't have much to say, @glpnk has dealt with this more than me, so i'll leave him to decide.

right now, @glpnk made a draft pull request #172. but until thats ready, lets make sure the files for real time temperatures are readable at the very least and not cause text editors to lock up and crash (thats what happens to me actually).

No matter how you calculate the speed, you need the correct addresses to get the right data to work with.

so, would you like to open a pull request fixing the realtime fanspeed for both cpu and gpu?

@glpnk
Copy link
Contributor

glpnk commented Dec 22, 2024

Once I'll finish cleanup and made SYS-FS API for fan tuning and RPM readout (if divide math operation is safe to do in kernel modules).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants