python爬虫爬取Boss直聘网

python爬虫爬取Boss直聘网

爬虫功能:

  • 此项目由python实现,主要用于爬取BOSS直聘网的职位、薪资、学历、任职要求,爬取的内容用csv文件存储
  • 可以输入爬取职位、爬取页数,爬取一页的时间大概需要23秒
  • 利用了fake_useragent库构造了随机请求头,增加了一定的反反爬虫能力

环境、架构:

开发语言:python3.6

开发环境:64位Windows10系统

开发工具:PyCharm

Python库:

  • csv
  • requests
  • BeautifulSoup
  • fake_useragent

实现效果

代码

获取网页源代码

def GetHtmlText(url):
    try:
        ua = UserAgent()
        headers = {'User-Agent': ua.random} #随机构造请求头,增加一定的反反爬能力。
        r = requests.get(url, timeout=5, headers=headers)
        r.raise_for_status() #判断r若果不是200,产生异常requests.HTTPError异常
        r.encoding = r.apparent_encoding 
        return r.text  #http响应内容的字符串形式,即URL返回的页面内容
    except:
        return None

主要用到了requests库的requests.get()方法来得到网页的源代码

将每页的内容填充列表字典

def fillJobList(lst,html):
    soup = BeautifulSoup(html,'html.parser') #解析网页源代码
    dic = {}
    for job in soup.findAll('div',{'class':'info-primary'}):
        try:
            position = job.find('div',{'class':'job-title'}).text
            pay = job.find('span',{'class':'red'}).text
            edu = job.find('p').text
            dic['position'] = position
            dic['pay'] = pay
            dic['edu'] = edu
            next_url = job.find('a').attrs['href']
            next_url = 'https://www.zhipin.com'+ next_url
            next_html = GetHtmlText(next_url)
            next_soup = BeautifulSoup(next_html,'html.parser')
            detail = next_soup.find('div',{'class':'text'}).text
            dic['detail'] = detail
            lst.append(dic.copy())
        except:
            continue

本段代码中最需要注意的就是这条lst.append(dic.copy()) ,如果将代码改成**lst.append(dic)**就会导致最后lst添加的内容回合最后一次的一样。具体解释可以参考此链接第一条回答与评论

用csv文件存放爬取的数据

def storeJobInfo(lst, fpath):
    keys = lst[0].keys()
    with open(fpath,'w',encoding='utf-8',newline ='') as f:
        writer = csv.DictWriter(f, keys)
        writer.writeheader()
        writer.writerows(lst)

把之前**fillJobList()**函数爬取存放的列表字典的内容放到csv文件里

主函数

def main():
    keywords = input('输入职位:')
    pages = int(input('获取页数:'))
    output_file = 'C://Users/35175/Desktop/Job.csv'
    job_list = []
    for i in range(1,pages+1):
        url ='https://www.zhipin.com/c101210100/h_101210100/?query='+keywords+'&page='+str(i)
        html = GetHtmlText(url)
        fillJobList(job_list, html)
    storeJobInfo(job_list, output_file)

输入所要爬取的职位和想要爬取的页数,用csv文件保存到桌面。

温馨提示

  • csv文件直接用excel打开后会乱码,是因为excel默认用ANSI编码方式打开的,只要将csv文件用记事本打开后,另存为时的编码方式选用ANSI编码方式后,再用excel打开后就不会乱码了
  • 用excel代开后,可能不是很好看。可以试试这样,先将具体职位要求拉动到合适大小,点击excel的自动换行后,再全选整个工作表(行与列的交接格),最后在行号中,行与行的交接处时,鼠标变换成中间一线,上下箭头时,双击。

开源

项目开源,读者可以去我的Github里自取,在JOB_GET文件夹里,传送门

代码更新

更新于2018.6.2

访问过于频繁导致Boss直聘网的反爬虫措施对我的IP进行了封禁,利用代理IP的方法反反爬虫。

构造代理IP池

写了一个爬取西刺代理的免费代理IP并存到mysql的爬虫,建立一个代理IP池。

def GetHtmlText(url):
'''获取网页的源代码'''
    try:
        ua = UserAgent()
        headers = {'User-Agent': ua.random}
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return None

def FillIpList(lst,html):
'''列表存放速度较快的代理ip'''
    soup = BeautifulSoup(html, 'html.parser')
    for tr in soup.find('table',{'id':'ip_list'}).tr.next_siblings:
        try:
            speed = tr.find('div',{'class': 'bar'}).attrs['title']
            match = re.search(r'\d.\d{3}',speed)
            if float(match.group(0)) < 1:
                ip = tr.find('td', {'class': 'country'}).next_sibling.next_sibling
                port = ip.next_sibling.next_sibling
                ip_port = ip.text + ':' + port.text
                lst.append(ip_port)
        except:
            continue

def PutInMysql(lst):
'''存到mysql数据库'''
    db = mysql.connector.connect(host='localhost', user='root', password='123456', port=3306, db='spiders')
    cursor = db.cursor()
    score = 10
    sql = 'REPLACE INTO proxy(ip_port, score) values(%s, %s)'
    for i in lst:
        try:
            cursor.execute(sql, (i, score))
            print((i))
            db.commit()
        except:
            db.rollback()
    db.close()

def main():
'''主函数'''
    pages = int(input('输入爬取ip页数:'))
    ip_list = []
    for i in range(1,pages+1):
        url = 'http://www.xicidaili.com/nn/'+str(i)
        html = GetHtmlText(url)
        FillIpList(ip_list,html)
    PutInMysql(ip_list)

main()

主爬虫代码更新

  • 获取网页源代码(代理模式)
  • 将每页的内容填充列表字典(代理模式)
  • 用mysql存放爬取的职位信息
  • 从mysql中取出代理ip
  • 主函数

获取网页源代码(代理模式)

def GetHtmlText_ip(url, proxy):
    '''获取网页的源代码(代理模式)'''
    try:
        proxies = {
            'http': 'http://' + proxy,
            'https': 'https://' + proxy,
        }
        ua = UserAgent()
        headers = {'User-Agent': ua.random}  # 构造随机请求头,增加一定的反爬能力
        r = requests.get(url, timeout=2, headers=headers, proxies=proxies)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return None

将每页的内容填充列表字典(代理模式)

def fillJobList_ip(lst, html, proxy):
    '''数据以列表字典的形式存储(代理模式)'''
    soup = BeautifulSoup(html,'html.parser')
    dic = {}
    for job in soup.findAll('div',{'class':'info-primary'}):
        try:
            position = job.find('div',{'class':'job-title'}).text
            pay = job.find('span',{'class':'red'}).text
            edu = job.find('p').text
            dic['position'] = position
            dic['pay'] = pay
            dic['edu'] = edu
            next_url = job.find('a').attrs['href']
            next_url = 'https://www.zhipin.com'+ next_url
            while True:
                next_html = GetHtmlText_ip(next_url, proxy)
                if next_html is not None:
                    break
            next_soup = BeautifulSoup(next_html,'html.parser')
            detail = next_soup.find('div',{'class':'text'}).text
            dic['detail'] = detail
            lst.append(dic.copy())
        except:
            continue

用mysql存放爬取的职位信息

def PutInMysql(lst):
    '''用Mysql存放爬取的数据'''
    db = mysql.connector.connect(host='localhost', user='root', password='123456', db='spiders', port=3306)
    cursor = db.cursor()
    sql = 'INSERT INTO bossjobinfo(position, pay, edu, detail) values(%s, %s, %s, %s)'
    i = 0
    for dic in lst:
        i += 1
        try:
            cursor.execute(sql, tuple(dic.values()))
            print('success'+str(i))
            db.commit()
        except:
            print('Failed'+str(i))
            db.rollback()
    db.close()

从mysql中取出代理ip

def PopIp():
    '''从mysql中取出代理ip'''
    db = mysql.connector.connect(host='localhost', user='root', password='123456', port=3306, db='spiders')
    cursor = db.cursor()
    sql = 'SELECT ip_port FROM proxy'
    cursor.execute(sql)
    rows = cursor.fetchall()
    db.close()
    return rows

主函数

def main():
    '''主函数'''
    start = time.clock()
    keywords = input('输入职位:')
    pages = int(input('获取页数:'))
    mode = input('代理模式(y/n):')
    job_list = []

    if mode.lower()=='n':
        for i in range(1,pages+1):
            url ='https://www.zhipin.com/c101210100/h_101210100/?query='+keywords+'&page='+str(i)
            html = GetHtmlText(url)
            if html is None:
                print('无法获取网页源代码,爬虫失败')
                break
            else:
                fillJobList(job_list, html)

    elif mode.lower()=='y':
        rows = PopIp()
        j = 0
        for i in range(1,pages+1):
            url ='https://www.zhipin.com/c101210100/h_101210100/?query='+keywords+'&page='+str(i)
            while True:
                print(j)
                proxy = ''.join(rows[j])
                html = GetHtmlText_ip(url, proxy)
                if html is not None:
                    print(proxy)
                    fillJobList_ip(job_list, html, proxy)
                    break
                elif j >= len(rows)-1: # 循环所有ip无法使用,再次循环
                    j = 0
                else:
                    j += 1
    
    PutInMysql(job_list) # 存到数据库

代理模式爬取速度全靠缘分,因为免费代理ip有可能访问5次,只有1次是成功的,别人都在用这些免费的ip,对速度有比较大的需求的话可以购买付费代理ip。

爬取效果

Github里的源码同步更新

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×