-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Cookies #31
Comments
Many thanks for the suggestion @motherlymuppet, So far, I have not planned to support crawling sites that require form-based log ins yet. However, this would very likely be reasonably straightforward to add an option for. One thing to bear in mind here, is that crawling a site with this type of log in may have unintended side-effects. For example, if there are links that perform actions like "delete this page", or similar, then SEO Macroscope will merrily follow these links too. This is also one of the reasons why GoogleBot et al will not crawl sites as a particular user. |
Hi @motherlymuppet, following up, I took a look at how Screaming Frog handles this situation. They too include a dire warning about data loss when using forms-based log ins. Cookie support itself may be fine though. Do you happen to have an example site that absolutely requires the setting of cookies in order to crawl it properly please? many thanks |
The program should only be sending GET requests, surely? In which case there shouldn't be any effects on the site if it's configured properly and not changing state based on GET requests. I can see how that would be an issue for misconfigured sites though. It'd be fine for it to be a very hidden option, it just seemed crazy that it wasn't there when it seems like a fairly fundamental part of accessing/navigating a website. The site I wanted to use it on was my own, and authentication was enabled due to large amounts of sensitive information on the site, which was like a knowledge base. I was attempting to crawl the site to reduce the amount of duplicated information and reorganize the site to be more natural to navigate. I don't have an example to hand that you could use for testing, sorry. |
Thanks @motherlymuppet, that feedback helps a lot. This is one of those cases where things in the real world, don't always match the specs. i.e. there will be some websites that will have regular links that have potentially damaging, to the user, side-effects when clicked. Generally, because these will always expect a human to be logged in, and not a robot that'll "click" everything it can get to on the page. For example, SEO Macroscope would not know to not click this link:
Under the hood, things are a little convoluted. The only HTTP methods used by the application are HEAD and GET. In as many cases as possible, HEAD is used to probe a URL, with a subsequent GET where necessary. You can see the rough flow that occurs for each fetched document here: ...in the In fact, I just recently added an option to force GETs on web servers that don't service HEAD requests properly. The whole web is hack piled upon hack ;-) So far, HTTP Basic Authentication should work in most cases; but as I don't get as much time as I'd like to work on this, forms-based authentication has so far not been on my TODO list. Hm, I don't actually have a forms-based authentication website to test with at the moment either... You make some great points though, and this will be something that I'll be taking a look at soon. many thanks! |
Hi again @motherlymuppet, At a quick glance, it appears that cookie support itself is reasonably trivial. So, the next detail would be the login process itself. Does your login form use a GET like this:
or a POST to an endpoint somewhat like this:
with the credentials in the body? If so, then this type of process would normally require the login page's URL and the credentials to be entered before the crawl takes place. Alternatively, a form field pattern would be required, with the credentials being prompted for during the crawl. Either way, the login page would be requested first, in order for the resultant session cookie to be captured. thanks! |
It's a POST endpoint, but that shouldn't matter. What I had in mind was a simple text field in the option where you could paste the cookie. I don't expect SEO macroscope to navigate me to the login page or guide me through it or anything like that, and I'd prefer it didn't for security reasons. I can use the login form myself in a web browser, then take the cookie from the developer menu. All you need to do then is provide the box to put the cookie into, and attach that cookie to all outgoing requests. That would provide complete flexibility across all login methods, and anyone trying to solve this problem is probably advanced enough to go to the developer menu and grab a cookie. I don't mean to be patronising if this is already obvious to you, but thought I'd give an example of what I mean:
|
I have several websites that I own that require the acceptance of using cookies, this agreement is the "form" but it gives no more rights to the user except access to the website. This is now a very common use case in EU and now US. I just notice on these websites SEOMacroscope fails |
It would be good if there was a way to set cookies for requests to allow for crawling sites that require authentication.
Is there currently a way to do this, or is this feature planned?
The text was updated successfully, but these errors were encountered: