-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
子域名抓取不准确 #8
Comments
分析作者的源码,发现在处理子域名的时候考虑不周全:
作者在处理时直接取域名.分隔最后两部分作为顶级域名,遇到xxx.gov.cn所以就会出错,建议使用tld库来处理:
这是我测试的结果:
www.indaa.com.cn其实不属于sgcc.com.cn子域名,属于误判断
在find_by_url处理subdomain时也有一处代码需要修改。 另外,有部份代码也需要优化一下。感谢分享。 |
你好,能帮忙开一个PR吗? |
已提交PR |
针对baidu.com jd.com等抓取的很准确,但是针对政府网站抓取不正确。。根据政府网站命名规则。,,,123.xxx.gov.cn(代表某厅网站),而某省网站是xxx.gov.cn 这样在抓取某厅的二级域名时,会将某gov.cn认为一级域名,而抓取某省所有厅部门网站,,,,而非某厅的二级域名
The text was updated successfully, but these errors were encountered: