Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Site config file not working #261

Open
frankhubrepo opened this issue May 6, 2021 · 2 comments
Open

Site config file not working #261

frankhubrepo opened this issue May 6, 2021 · 2 comments

Comments

@frankhubrepo
Copy link

I am trying to fetch the content from this article:
https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report

However as it doesn't work, i tried adding a config file as shown here: https://doc.wallabag.org/en/user/errors_during_fetching.html

This is the code within the config file:

title://body//h1[@class="headline"]

body://body//div[contains(@class, "field-type-text-with-summary")]

test_url: https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report

The issue is even then I don't get the content, and I know the query is right because i can see it in the browser console:
image

image

Also here is the log:

[2021-05-06 19:40:53] graby.INFO: Graby is ready to fetch [] []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"businesstimes.com.sg"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"businesstimes.com.sg.txt"} []
[2021-05-06 19:40:53] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"businesstimes.com.sg"} []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 19:40:53] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"global"} []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"businesstimes.com.sg.merged"} []
[2021-05-06 19:40:53] graby.INFO: Fetching url: {url} {"url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:41:02] graby.INFO: Data fetched: {data} {"data":{"effective_url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report","body":"(only length for debug): 152622","headers":{"alt-svc":"clear","cache-control":"no-cache, no-store, must-revalidate","content-type":"text/html; charset=UTF-8","date":"Thu, 06 May 2021 17:40:53 GMT","expires":"0","istl-response":"1","pragma":"no-cache","referrer-policy":"no-referrer-when-downgrade, no-referrer-when-downgrade","server":"ECD (sgb/C7A3)","via":"1.1 google","x-ion-hop":"true","x-vmg-version":"v2.3.21","content-length":"152622"},"status":200}} []
[2021-05-06 19:41:02] graby.INFO: Treating as UTF-8 {"encoding":"utf-8"} []
[2021-05-06 19:41:03] graby.INFO: Looking for site config files to see if single page link exists [] []
[2021-05-06 19:41:03] graby.INFO: Returning cached and merged site config for {host} {"host":"businesstimes.com.sg"} []
[2021-05-06 19:41:03] graby.INFO: No "single_page_link" config found [] []
[2021-05-06 19:41:03] graby.INFO: Attempting to extract content [] []
[2021-05-06 19:41:03] graby.INFO: Returning cached and merged site config for {host} {"host":"businesstimes.com.sg"} []
[2021-05-06 19:41:03] graby.INFO: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2021-05-06 19:41:03] graby.INFO: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2021-05-06 19:41:03] graby.INFO: Body size after Readability: {length} {"length":96} []
[2021-05-06 19:41:03] graby.INFO: Opengraph "og:" data: {ogData} {"ogData":[]} []
[2021-05-06 19:41:03] graby.INFO: Opengraph "article:" data: {ogData} {"ogData":[]} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for title {"pattern":"//body//h1[@class=\"headline\"]"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for body (content length: {content_length}) {"pattern":"//body//div[contains(@class, \"field-type-text-with-summary\")]","content_length":96} []
[2021-05-06 19:41:03] graby.INFO: Using Readability [] []
[2021-05-06 19:41:03] graby.INFO: Date is bad (strtotime failed): {date} {"date":null} []
[2021-05-06 19:41:03] graby.INFO: Success ? {is_success} {"is_success":false} []
[2021-05-06 19:41:03] graby.INFO: Extract failed [] []

Any insight on what could be happening here or something I'm missing?

@hwiorn
Copy link

hwiorn commented May 30, 2021

Recently, I've tried to make site-configs for wallabag server and I noticed some XPATH problem like this issue. You should check log/html.log. Graby uses the php-readability to process HTML, and it strips and flats many tags for readability. This mean XPATHs of a site-config won't be the same like XPATHs of browsers and you can't use them in the site-config directly.

In my case, I wanted to extract a "real" author and a "real" title from an article in some website. But I got nothing after processing. Even though, I used XPATHs which work correctly in Chrome and Firefox browser. I can't use https://siteconfig.fivefilters.org/ because it doesn't show CSS and XPATH bar in bottom when I tested that websites.

Put the debug settings in your some-graby-test.php file and run it.

$graby = new Graby([
    'debug' => true,
    'log_leve' => 'debug',
]);

Then, you can see the log/html.log file.

@j0k3r
Copy link
Owner

j0k3r commented Oct 5, 2021

The problem is that Graby is retrieving that HTML: response.html.txt
Which is definitely not the one you are querying from your browser console.

Maybe we need to add some cookie for the request. I've tried some without success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants