Skip to content
This repository has been archived by the owner on Nov 17, 2022. It is now read-only.

爬取到一定的数量的时候,出现disconnect #75

Open
mybu opened this issue Dec 16, 2020 · 0 comments
Open

爬取到一定的数量的时候,出现disconnect #75

mybu opened this issue Dec 16, 2020 · 0 comments

Comments

@mybu
Copy link

mybu commented Dec 16, 2020

使用命令:Python版本 3.6.5
python gitbook.py https://wizardforcel.gitbooks.io/python-quant-uqer/content/
根据爬取的日志,定位代码,优化了一个地方:增加了休眠时间
async def gettext(self, index, url, level, title):
'''
return path's html
'''

    secRnd = random.randint(2, 7)
    time.sleep(secRnd)
    print("防止压不住,设置暂停时间:{}秒,crawling : {}".format(secRnd, url))
    try:
        metatext = await request(url, self.headers, timeout=10)
    except Exception as e:
        time.sleep(secRnd)
        print("防止压不住,设置暂停时间:{}秒,recrawling : {}".format(secRnd, url))
        metatext = await request(url, self.headers)
    try:
        text = ChapterParser(metatext, title, level, ).parser()
        print("done : ", url)            
        self.content_list[index] = text
    except IndexError:
        print('faild at : ', url, ' maybe content is empty?')

但是到爬取到一定的时候,还是会出现disconnect的错误。
done : https://wizardforcel.gitbooks.io/python-quant-uqer/content/81.html
Traceback (most recent call last):
File "gitbook.py", line 5, in
Gitbook2PDF(url).run()
File "E:\code\pythonCode\thirdparty\gitbook2pdf-master\gitbook2pdf\gitbook2pdf.py", line 202, in run
loop.run_until_complete(self.crawl_main_content(content_urls))
File "d:\ProgramData\Anaconda3\envs\python36\lib\asyncio\base_events.py", line 468, in run_until_complete
return future.result()
File "E:\code\pythonCode\thirdparty\gitbook2pdf-master\gitbook2pdf\gitbook2pdf.py", line 224, in crawl_main_content
await asyncio.gather(*tasks)
File "E:\code\pythonCode\thirdparty\gitbook2pdf-master\gitbook2pdf\gitbook2pdf.py", line 246, in gettext
metatext = await request(url, self.headers)
File "E:\code\pythonCode\thirdparty\gitbook2pdf-master\gitbook2pdf\gitbook2pdf.py", line 21, in request
async with session.get(url, headers=headers, timeout=timeout) as resp:
File "d:\ProgramData\Anaconda3\envs\python36\lib\site-packages\aiohttp\client.py", line 1005, in aenter
self._resp = await self._coro
File "d:\ProgramData\Anaconda3\envs\python36\lib\site-packages\aiohttp\client.py", line 497, in _request
await resp.start(conn)
File "d:\ProgramData\Anaconda3\envs\python36\lib\site-packages\aiohttp\client_reqrep.py", line 844, in start

message, payload = await self._protocol.read()  # type: ignore  # noqa

File "d:\ProgramData\Anaconda3\envs\python36\lib\site-packages\aiohttp\streams.py", line 588, in read
await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: None

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant