Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode URL properly #531

Merged
merged 2 commits into from
Oct 15, 2024
Merged

Encode URL properly #531

merged 2 commits into from
Oct 15, 2024

Conversation

dr0i
Copy link
Member

@dr0i dr0i commented Oct 11, 2024

Question marks weren't encoded. It's better to use a library.

Resolves hbz/lobid-gnd#408.

@dr0i dr0i changed the title Encode URL propery (lobid-gnd#408) Encode URL properly Oct 11, 2024
Question marks weren't encoded. It's better to use a library.
@dr0i dr0i force-pushed the lobid-gnd-408-urlencodingImagesproxy branch from e638e59 to 98459aa Compare October 11, 2024 15:47
@dr0i
Copy link
Member Author

dr0i commented Oct 11, 2024

@Phu2 npm seems to be missing at the server so I cannot build gatsby and thus no stage deployment.
An # apt install npm wants to download 630 MB, a bit much IMO.
Have a look, I certainly miss something.

@dr0i dr0i requested a review from Phu2 October 11, 2024 15:56
@Phu2
Copy link
Contributor

Phu2 commented Oct 14, 2024

npm seems to be missing at the server so I cannot build gatsby and thus no stage deployment.

Did you issued nvm use 18 before the npm command as stated in the README? That's how i use it.

However, i am not familiar with this nvm/npm stuff. You can probably do some configurations to have the right node version and the npm command available instantly, see eg. https://stackoverflow.com/questions/33874049/npm-not-found-when-using-nvm

@Phu2 Phu2 assigned dr0i and unassigned Phu2 Oct 14, 2024
@dr0i
Copy link
Member Author

dr0i commented Oct 14, 2024

Thx for nvm reminder @Phu2! I totally overread that in the README:

@dr0i
Copy link
Member Author

dr0i commented Oct 14, 2024

Deployed, but this doesn't fix the problem:

I've also noted that the depiction.thumbnail URL from https://lobid.org/gnd/13731518X.json is already propery Percent-Esaped ...

Maybe the problem lies in the redirection.

The PHP parse-url function cannot be used to separate "URL path" and "URL
query" because if an "?" is part of the path it would be nontheless
identified as a query parameter. Explicitly stick to the last "?" seems
to be safe in the context of wikimedia commmons at least.
@dr0i
Copy link
Member Author

dr0i commented Oct 14, 2024

At least the problem is fixed :)

It was neccessary to preserve the query parameter.
The PHP parse-url function cannot be used to separate URL path and URL query because if an ? is part of the path it would be nontheless identified as a query parameter. Explicitly sticking to the last ? seems to be safe in the context of wikimedia commmons at least.

If this workaround here doesn't generates other problems we could go with it.
If somebody is (someday) interested in a proper fix:
I have the impression that the underlying problem is the passing of the depiction.thumbnail data to imageproxy.php. Because: the data is already perfectly escaped in the data! If it's not a passing-the-data problem something on our web server configuration is causing the trouble (AllowEncodedSlashes On or something along the line).

@dr0i
Copy link
Member Author

dr0i commented Oct 14, 2024

@Phu2 can you have another look?

@dr0i dr0i assigned Phu2 and unassigned dr0i Oct 14, 2024
@acka47 acka47 requested a review from Phu2 October 14, 2024 12:48
@Phu2
Copy link
Contributor

Phu2 commented Oct 14, 2024

It works 👍

I still don't understand why for example this work:
https://lobid.org/imagesproxy?url=https://commons.wikimedia.org/wiki/Special:FilePath/Pfarrkirche%20Sankt%20Stefan%20im%20Gailtal.JPG?width=270

and this do not work:
https://lobid.org/imagesproxy?url=https://commons.wikimedia.org/wiki/Special:FilePath/Wissenschaft%2C%20%C3%B6ffne%20dich%21%20Welche%20Infrastruktur%20braucht%20gute%20und%20offene%20Wissenschaft%3F%20%28Podiumsdiskussion%2C%2012.06.2019%29%20023.jpg?width=270

This is the orginal url passed to imagesproxy.php:
https://commons.wikimedia.org/wiki/Special:FilePath/Wissenschaft%2C%20%C3%B6ffne%20dich%21%20Welche%20Infrastruktur%20braucht%20gute%20und%20offene%20Wissenschaft%3F%20%28Podiumsdiskussion%2C%2012.06.2019%29%20023.jpg?width=270

This is the resulting url modified in imagesproxy.php by your commits:
https://commons.wikimedia.org/wiki/Special%3AFilePath/Wissenschaft%252C%2520%25C3%25B6ffne%2520dich%2521%2520Welche%2520Infrastruktur%2520braucht%2520gute%2520und%2520offene%2520Wissenschaft%253F%2520%2528Podiumsdiskussion%252C%252012.06.2019%2529%2520023.jpg?width=270

Apart from the encoding of the colon in Special:FilePath and and the identical query parameter ?width=270 the resulting filename is double encoded.

I have the impression that the underlying problem is the passing of the depiction.thumbnail data to imageproxy.php. Because: the data is already perfectly escaped in the data! If it's not a passing-the-data problem something on our web server configuration is causing the trouble (AllowEncodedSlashes On or something along the line).

I agree with this. The script works on the commandline and fails when is executed via web server.

Already tried in vhosts.d/stage.lobid.org.conf:

-        ProxyPass  /imagesproxy http://stage.lobid.org/imagesproxy.php
-        ProxyPassReverse  /imagesproxy http://stage.lobid.org/imagesproxy.php
+        ProxyPass  /imagesproxy http://stage.lobid.org/imagesproxy.php nocanon
+        ProxyPassReverse  /imagesproxy http://stage.lobid.org/imagesproxy.php nocanon

but this does not fix the problem.

@Phu2 Phu2 assigned dr0i and unassigned Phu2 Oct 14, 2024
@acka47
Copy link
Contributor

acka47 commented Oct 15, 2024

This image proxy keeps making problems which take some time to resolve. Maybe we did not choose the best solution for the problem after all?

In the original issue #421, I linked to that PHP script. After @dr0i had tried to implement the image proxy with Apache conf for some time he switched to the script solution. Maybe there is a better solution available now, I could not find something with a cursory web search, though, but might not have used the best search terms...

@dr0i dr0i merged commit c3d7e95 into master Oct 15, 2024
1 check passed
@dr0i
Copy link
Member Author

dr0i commented Oct 15, 2024

Deployed to poduction, closed.

@dr0i dr0i deleted the lobid-gnd-408-urlencodingImagesproxy branch October 15, 2024 12:17
@dr0i
Copy link
Member Author

dr0i commented Oct 15, 2024

@Phu2 :

This is the resulting url modified in imagesproxy.php by your commits:
https://commons.wikimedia.org/wiki/Special%3AFilePath/Wissenschaft%252C%2520%25C3%25B6ffne%2520dich%2521%2520Welche%2520Infrastruktur%2520braucht%2520gute%2520und%2520offene%2520Wissenschaft%253F%2520%2528Podiumsdiskussion%252C%252012.06.2019%2529%2520023.jpg?width=270

You got that from php var_dump, right? Interestingly, looking at the webservers log file (tail -f access_log_lobid | grep imagesproxy |grep Wissenschaft), I see not what php's var_dump prints, but the exact clicked URL (
"https://lobid.org/imagesproxy?url=https://commons.wikimedia.org/wiki/Special:FilePath/Wissenschaft%2C%20%C3%B6ffne%20dich%21%20Welche%20Infrastruktur%20braucht%20gute%20und%20offene%20Wissenschaft%3F%20%28Podiumsdiskussion%2C%2012.06.2019%29%20023.jpg?width=270")
Which is just fine.

So something is happening at the webserver or in the browser, it seems.
However, as it works now with this urlencoding, I am happy enough.

@Phu2
Copy link
Contributor

Phu2 commented Oct 16, 2024

You got that from php var_dump, right?

No, i'm just echo'ing in my test script, see @emphytos:/tmp/imagesproxy-test-local.php. Try @emphytos:/tmp$ php imagesproxy-test-local.php

So something is happening at the webserver or in the browser, it seems.

Yeah, i suspect mod_proxy decodes the url before passing it on. That's why it works with the double encoded file name. And that's why you see the single encoded string in the log. Btw, it's probably the comma (encoded as %2C) in the file name which causes trouble.

But i wonder why

+        ProxyPass  /imagesproxy http://stage.lobid.org/imagesproxy.php nocanon
+        ProxyPassReverse  /imagesproxy http://stage.lobid.org/imagesproxy.php nocanon

didn't work in my testing.

See:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image won't load from Wikimedia
3 participants