爬取基于问卷星平台的练习试题库

爬取基于问卷星平台的练习试题库 | Python

2021 年 12 月 10 日

5095 次浏览

2267字数

# 起因

帮同学做的一个爬虫，考前测试题发布于问卷星，每次访问都是不同的试题，因此推测问卷链接的是考试题库，并且问卷星的网页禁止选中文本复制，把题库抓下来何乐而不为。
[![考试问卷](https://s1.ax1x.com/2021/12/10/oIkPL8.png)](https://imgtu.com/i/oIkPL8)

## 过程

首先爬虫首先考虑使用Python，问卷星作为一个著名的问卷调查平台，肯定有前辈写过相关用来爬问卷星调查问卷的代码。

下面是参考案例：

<div class="list-group list-group-lg list-group-sp row" style="margin: 0"><div class="col-sm-6">
<a href="https://blog.csdn.net/wozaiyizhideng/article/details/106485259" target="_blank" class="no-external-link no-underline-link list-group-item no-borders 
box-shadow-wrap-lg"> <span class="pull-left thumb-sm avatar m-r"> <img noGallery 
src="https://banwuyan.cc/usr/plugins/Handsome/assets/image/nopic.jpg" alt="Error" class="img-square"></span> <span class="clear"><span 
class="text-ellipsis">
  Python3 爬虫--- 问卷星内容爬取</span> <small class="text-muted clear text-ellipsis">Python3 爬虫--- 问卷星内容爬取_wozaiyizhideng的博客-CSDN博客_爬虫 问卷星</small> </span> </a>
</div></div>

一些区别，给出的案例中，所要爬取的问卷页面是静态的，每次访问问卷内容不会改变，分析页面的源代码也发现，网页元素标签也有所差异，因此不能直接拿过来使用，结合实际情况修改一下，完成自动爬取试题库。

[![静态问卷](https://s1.ax1x.com/2021/12/10/oIkkdg.png)](https://imgtu.com/i/oIkkdg)

[![动态问卷](https://s1.ax1x.com/2021/12/10/oIkFeS.png)](https://imgtu.com/i/oIkFeS)

最终代码如下：

```
# coding=gbk
import time
from requests_html import HTMLSession

# wenjuanxing_ID = 55123312
# wenjuanxing_URL = "https://ks.wjx.top/jq/{}.aspx".format(wenjuanxing_ID)
wenjuanxing_URL = "https://ks.wjx.top/vm/PpUtjIg.aspx"

def parse_post_data(resp,s):
    '''
    解析问题和选项
    '''
    questions = resp.html.find('fieldset', first=True).find('.field')
    for i, q in enumerate(questions):
        title = q.find('.field-label', first=True).text
        choices = [t.text for t in q.find('.ui-radio')]
        print(title)
        with open(s, 'a', encoding='utf-8') as f:
            f.write(title + "\n")
        for choice in choices:
            print(choice)
            with open(s, 'a', encoding='utf-8') as f:
                f.write(choice + "\n")
        print('***************************************************\n')
        time.sleep(0.2)
        with open(s, 'a', encoding='utf-8') as f:
            f.write('***************************************************\n')

def main():
    print('开始爬取试卷内容')
    print('链接:%s' % wenjuanxing_URL)
    r = int(input("爬取的套数"))
    r = r + 1
    for i in range(1,r):
        s = "第" + str(i) + "套测试试卷.txt"
        with open(s, 'a+', encoding='utf-8') as f:
            f.write('\n试卷' + str(i) + "\n")
        session = HTMLSession()
        resp = session.get(wenjuanxing_URL)
        parse_post_data(resp,s)

if __name__ == '__main__':
    main()
```

## 结果

[![试卷](https://s1.ax1x.com/2021/12/10/oIkCsf.png)](https://imgtu.com/i/oIkCsf)

最后修改：2022 年 02 月 26 日