Site config file not working #261

frankhubrepo · 2021-05-06T17:46:22Z

I am trying to fetch the content from this article:
https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report

However as it doesn't work, i tried adding a config file as shown here: https://doc.wallabag.org/en/user/errors_during_fetching.html

This is the code within the config file:

title://body//h1[@class="headline"]

body://body//div[contains(@class, "field-type-text-with-summary")]

test_url: https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report

The issue is even then I don't get the content, and I know the query is right because i can see it in the browser console:

Also here is the log:

[2021-05-06 19:40:53] graby.INFO: Graby is ready to fetch [] []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"businesstimes.com.sg"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"businesstimes.com.sg.txt"} []
[2021-05-06 19:40:53] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"businesstimes.com.sg"} []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 19:40:53] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"global"} []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"businesstimes.com.sg.merged"} []
[2021-05-06 19:40:53] graby.INFO: Fetching url: {url} {"url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:41:02] graby.INFO: Data fetched: {data} {"data":{"effective_url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report","body":"(only length for debug): 152622","headers":{"alt-svc":"clear","cache-control":"no-cache, no-store, must-revalidate","content-type":"text/html; charset=UTF-8","date":"Thu, 06 May 2021 17:40:53 GMT","expires":"0","istl-response":"1","pragma":"no-cache","referrer-policy":"no-referrer-when-downgrade, no-referrer-when-downgrade","server":"ECD (sgb/C7A3)","via":"1.1 google","x-ion-hop":"true","x-vmg-version":"v2.3.21","content-length":"152622"},"status":200}} []
[2021-05-06 19:41:02] graby.INFO: Treating as UTF-8 {"encoding":"utf-8"} []
[2021-05-06 19:41:03] graby.INFO: Looking for site config files to see if single page link exists [] []
[2021-05-06 19:41:03] graby.INFO: Returning cached and merged site config for {host} {"host":"businesstimes.com.sg"} []
[2021-05-06 19:41:03] graby.INFO: No "single_page_link" config found [] []
[2021-05-06 19:41:03] graby.INFO: Attempting to extract content [] []
[2021-05-06 19:41:03] graby.INFO: Returning cached and merged site config for {host} {"host":"businesstimes.com.sg"} []
[2021-05-06 19:41:03] graby.INFO: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2021-05-06 19:41:03] graby.INFO: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2021-05-06 19:41:03] graby.INFO: Body size after Readability: {length} {"length":96} []
[2021-05-06 19:41:03] graby.INFO: Opengraph "og:" data: {ogData} {"ogData":[]} []
[2021-05-06 19:41:03] graby.INFO: Opengraph "article:" data: {ogData} {"ogData":[]} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for title {"pattern":"//body//h1[@class=\"headline\"]"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for body (content length: {content_length}) {"pattern":"//body//div[contains(@class, \"field-type-text-with-summary\")]","content_length":96} []
[2021-05-06 19:41:03] graby.INFO: Using Readability [] []
[2021-05-06 19:41:03] graby.INFO: Date is bad (strtotime failed): {date} {"date":null} []
[2021-05-06 19:41:03] graby.INFO: Success ? {is_success} {"is_success":false} []
[2021-05-06 19:41:03] graby.INFO: Extract failed [] []

Any insight on what could be happening here or something I'm missing?

The text was updated successfully, but these errors were encountered:

hwiorn · 2021-05-30T14:38:22Z

Recently, I've tried to make site-configs for wallabag server and I noticed some XPATH problem like this issue. You should check log/html.log. Graby uses the php-readability to process HTML, and it strips and flats many tags for readability. This mean XPATHs of a site-config won't be the same like XPATHs of browsers and you can't use them in the site-config directly.

In my case, I wanted to extract a "real" author and a "real" title from an article in some website. But I got nothing after processing. Even though, I used XPATHs which work correctly in Chrome and Firefox browser. I can't use https://siteconfig.fivefilters.org/ because it doesn't show CSS and XPATH bar in bottom when I tested that websites.

Put the debug settings in your some-graby-test.php file and run it.

$graby = new Graby([
    'debug' => true,
    'log_leve' => 'debug',
]);

Then, you can see the log/html.log file.

j0k3r · 2021-10-05T12:50:51Z

The problem is that Graby is retrieving that HTML: response.html.txt
Which is definitely not the one you are querying from your browser console.

Maybe we need to add some cookie for the request. I've tried some without success.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Site config file not working #261

Site config file not working #261

frankhubrepo commented May 6, 2021

hwiorn commented May 30, 2021 •

edited

Loading

j0k3r commented Oct 5, 2021

Site config file not working #261

Site config file not working #261

Comments

frankhubrepo commented May 6, 2021

hwiorn commented May 30, 2021 • edited Loading

j0k3r commented Oct 5, 2021

hwiorn commented May 30, 2021 •

edited

Loading