CentOS教程之用python做一个简略的收集爬虫

只看该作者 · 发表于 2015-1-14 20:47:22

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

您需要登录才可以下载或查看，没有帐号？立即注册

x

如果您觉得本篇CentOSLinux教程讲得好，请记得点击右边漂浮的分享程序，把好文章分享给你的好朋友们！[size=1.2em]概述：
这是一个简略的爬虫，感化也很简略：给定一个网址，抓取这个网址的页面，然后从中提取知足请求的url地址，把这些地址放入队列中，当把给定的网页抓取终了后，就把队列中的网址作为参数，法式再次去抓取这个页面的数据。直达到到必定的深度（由参数指定）后停滞。法式将抓取的网页数据保留在当地。我应用的是mysql数据库。上面恰是开端。
[size=1.2em]树立数据库：
启动mysql，树立一个database
createdatabasespcharactersetutf8;

然后树立一个表，这个表包括三个字段，一个保留url，一个保留原始的html代码，还有一个保留去失落html标签后的数据。之所以还要第三个是为了让今后做搜刮的时刻效力能高一点。

usesp;
createtablewebdata(urllongtext,htmllongtext,puredatalongtext);

数据库预备好今后，就开端写代码了。
[size=1.2em]python法式：
法式我就不做过量解释了，法式中症结部门有正文。法式的参数解释一下：

-u要抓取的网址
-d抓取深度，顺着链接爬若干层页面。页面每多一层，数目几何倍增加。默许为2
-t并发线程数。默许为10线程
-otimeout值。urlopen的timeout阀值。默许为20秒
-l指定日记文件的路径和名字，默许为以后路径，名为logSpider.log
-v指定日记的记载具体水平，有三个参数，默许为normal

[quote]simple只记载毛病信息
normal除毛病信息，还记载一些法式运转过程当中的状况信息
all一切的信息，和爬过的url网址都记载在内

[/quote]
关于timeout要解释一下：各体系timeout的默许值
BSD75secondsLinux189secondsSolaris225secondsWindowsXP21seconds对mysql设置装备摆设文件做修正：
我在做试验的时刻发明，假如抓取深度为2，那末法式可以顺遂的运转。但把深度调为3的时刻，就会涌现2006-MySQLserverhasgoneaway毛病，我依照网上的办法，修正了mysql设置装备摆设文件后，就处理了这个成绩。办法是将设置装备摆设文件中max_allowed_packet=1M修正为max_allowed_packet=16M
我以此参数运转法式：
pythonspider.py-uhttp://www.chinaunix.net-d3-t15-o10

法式运转了26分钟，胜利抓取了4346个页面，发生了35KB的日记。法式均匀每秒能抓取2.8个页面。日记中最多的记载就是某某网址没法翻开。
好了，上面上代码：

#-*-coding:utf-8-*-
fromreimportsearch
importurllib2
importMySQLdb
fromBeautifulSoupimportBeautifulSoup
importthreading
fromdatetimeimportdatetime
fromoptparseimportOptionParser
importsys
importlogging
importsocket
fromurlparseimporturlparse
importhttplib

URLS={}
lock=threading.Lock()

classnewThread(threading.Thread):
def__init__(self,level,url,db):
threading.Thread.__init__(self)
self.level=level
self.url=url
self.db=db
defrun(self):
globallock
globallog
foriinself.url:
log.debug(%s:%s%(datetime.now(),i))
printi
temp,html,data=getURL(i)
#因为没法翻开此url，超时，前往的状况码不是200，
#弃失落此url，从新开端轮回
ifnottemp:
continue
#获得锁，让此线程平安的更新数据
iflock.acquire():
self.db.save(i,html,data)
#一切线程将搜集到的url存入URLS列表，
#然后在主线程中将URL中反复的url删除。
URLS[self.level].extend(temp)
lock.release()

classsaveData():
def__init__(self):
self.db=MySQLdb.connect(user=root,db=sp,unix_socket=/tmp/mysql.sock)
self.cur=self.db.cursor()
self.cur.execute(deletefromwebdata)
self.commit()
log.info(%s:Connectdatabasesuccess%datetime.now())
defsave(self,url,html,pureData):
globallog
SQL=insertintowebdatavalues(%s,%s,%s)%(url,html,pureData)
try:
self.cur.execute(SQL)
except(MySQLdb.ProgrammingError,MySQLdb.OperationalError),e:
log.error(%s:%s%(datetime.now(),e))
return
self.commit()
defcommit(self):
self.db.commit()
defclose(self):
self.db.close()

defgetURL(url):
URLS=[]
globallog
globalsource
globaldomainName
try:
page=urllib2.urlopen(url)
except(urllib2.URLError,httplib.BadStatusLine):
log.error(%s:URLCANNOTOPEN----%s%(datetime.now(),url))
return(,,)
else:
ifpage.code==200:
try:
html=page.read().decode(gbk,ignore).encode(utf-8)
except:
log.error(%s:TIMEOUT----%s%(datetime.now(),url))
printTIMEOUT
return(,,)
else:
log.error(%s:RESPONSECODEISNOT200----%s%(datetime.now(),url))
return(,,)
html=html.replace("",")
#获得去失落HTML元素后的数据
try:
pureData=.join(BeautifulSoup(html).findAll(text=True)).encode(utf-8)
exceptUnicodeEncodeError:
pureData=html
#上面的代码用于在网页中寻觅相符前提的url地址
rawHtml=html.split(
)
foriinrawHtml:
times=i.count()
iftimes:
foryinrange(times):
pos=i.find()
ifpos!=-1:
#在网页中寻觅a标志，提取个中的链接，
#链接有两种情势的，一种双引号，一种单引号
newURL=search(<ahref=".+",i[:pos])
ifnewURLisnotNone:
newURL=newURL.group().split()[1][6:-1]
if">innewURL:
newURL=search(.+">,newURL)
ifnewURLisNone:
continue
newURL=newURL.group()[:-2]
#若地址为空，则进入下一个轮回
ifnotnewURL:
continue
#假如是绝对地址，须要转为相对地址
ifnotnewURL.startswith(http):
ifnewURL[0]==/:
newURL=source+newURL
else:
newURL=source+/+newURL
ifdomainNamenotinnewURLornewURLinURLSornewURL==urlornewURL==url+/:
continue
URLS.append(newURL)
i=i[pos+4:]
return(URLS,html,pureData)

if__name__==__main__:
USAGE=
spider-u[url]-d[num]-t[num]-o[secs]-l[filename]-v[level]
-u:urlofawebsit
-d:thedeepsofthespiderwillgetinto.defaultis2
-t:howmanythreadsworkatthesametime.defaultis10
-o:urlrequesttimeout.defaultis20secs
-l:assignthelogfilenameandlocation.defaultnameislogSpider.log
-v:valuesarequietnormalall.defaultisnormal
simple----onlylogtheerrormessage
normal----errormessageandsomeaddtionmessage
all----notonlymessage,butalsourlswillbelogged.
Examples:
spider-uhttp://www.chinaunix.net-t16-vnormal

LEVELS={simple:logging.WARNING,
normal:logging.INFO,
all:logging.DEBUG}
opt=OptionParser(USAGE)
opt.add_option(-u,type=string,dest=url)
opt.add_option(-d,type=int,dest=level,default=2)
opt.add_option(-t,type=int,dest=nums,default=10)
opt.add_option(-o,type=int,dest=out,default=20)
opt.add_option(-l,type=string,dest=name,default=logSpider.log)
opt.add_option(-v,type=string,dest=logType,default=normal)
options,args=opt.parse_args(sys.argv)
source=options.url
level=options.level
threadNums=options.nums
timeout=options.out
logfile=options.name
logType=options.logType
ifnotsourceorlevel<0orthreadNums<1ortimeout<1orlogTypenotinLEVELS.keys():
printopt.print_help()
sys.exit(1)
ifnotsource.startswith(http://):
source=http://+source
ifsource.endswith(/):
source=source[:-1]
domainName=urlparse(source)[1].split(.)[-2]
ifdomainNamein[com,edu,net,org,gov,info,cn]:
domainName=urlparse(source)[1].split(.)[-3]
socket.setdefaulttimeout(timeout)
log=logging.getLogger()
handler=logging.FileHandler(logfile)
log.addHandler(handler)
log.setLevel(LEVELS[logType])

startTime=datetime.now()
log.info(Startedat%s%startTime)
subURLS={}
threads=[]
foriinrange(level+1):
URLS=[]
#初始化-链接数据库
db=saveData()
#获得首页内的url
URLS[0],html,pureData=getURL(source)
ifnotURLS[0]:
log.error(cannotopen%s%source)
printcannotopen+source
sys.exit(1)
db.save(source,html,pureData)
forleinrange(level):
#依据线程数将以后的URLS年夜列表切割成小的列表
nowL=-------------level%d------------%(le+1)
printnowL
log.info(nowL)
preNums=len(URLS[le])/threadNums
foriinrange(threadNums):
temp=URLS[le][:preNums]
ifi==threadNums-1:
subURLS=URLS[le]
else:
subURLS=temp
URLS[le]=URLS[le][preNums:]
#将线程参加线程池，并启动。起首清空线程池
threads=threads[0:0]
foriinrange(threadNums):
t=newThread(le+1,subURLS,db)
t.setDaemon(True)
threads.append(t)
foriinthreads:
i.start()
#期待一切线程停止
foriinthreads:
i.join()
nowLevel=le+1
#将列表中雷同的url去除
URLS[nowLevel]=list(set(URLS[nowLevel]))
foriinrange(nowLevel):
forurlinURLS:
ifurlinURLS[nowLevel]:
URLS[nowLevel].remove(url)
#写入数据库
#db.commit()
db.close()
endTime=datetime.now()
log.info(Endedat%s%endTime)
log.info(Takes%s%(endTime-startTime))

[size=1.2em]搜刮
有了当地存储的数据后，就能够对个中的数据停止搜刮。其实搜刮引擎是若何依据症结字来检索互联网的，这个我其实不清晰。我做这个仅仅是一个演示。假如还记得我后面说的数据库表中的三个字段的话，那这段法式就不消我说明了。法式将输出的词在puredata中检索，若检索到，就输入关于的url。

importMySQLdb
db=MySQLdb.connect(user=root,db=sp,unix_socket=/tmp/mysql.sock)
cur=db.cursor()
nums=cur.execute(select*fromwebdata)
print%ditems%nums
x=cur.fetchall()
printinputsomethingtosearch,"exit"toexit
whileTrue:
key=raw_input(>)
ifkey==exit:
break
foriinrange(nums):
ifkeyinx[2]:
printx[0]
printsearchfinished
db.close()

最初给人人上一张搜刮成果的截图：

登录/注册后可看大图

135055F15-0.jpg (108.22 KB, 下载次数: 8)

下载附件保存到相册

CentOS教程之用python做一个简略的收集爬虫

2015-1-14 20:47 上传

欢迎大家来到仓酷云论坛！

只看该作者 · 发表于 2015-1-25 17:51:19

写学习日记，这是学习历程的见证，同时我坚持认为是增强学习信念的法宝。

只看该作者 · 发表于 2015-2-3 13:13:15

Windows?是图形界面的，Linux类似以前的?DOS，是文本界面的，如果你运行了图形界面程序X-WINDOWS后，Linux?也能显示图形界面，也有开始菜单、桌面、图标等。

只看该作者 · 发表于 2015-2-9 04:11:52

一定要学好命令，shell是命令语言，命令解释程序及程序设计语言的统称，shell也负责用户和操作系统之间的沟通。

只看该作者 · 发表于 2015-2-26 23:54:42

当然你不需搭建所有服务，可以慢慢来。自己多动手，不要非等着别人帮你解决问题。

只看该作者 · 发表于 2015-3-8 19:56:18

现在的linux操作系统如redhat，难点，红旗等，都是用这么一个内核，加上其它的用程序(包括X)构成的。

只看该作者 · 发表于 2015-3-16 19:12:10

熟悉并掌握安装Linux，安装是学习的前提。目前较常见的安装方法有二种：

只看该作者 · 发表于 2015-3-23 04:10:01

甚至目前许多应用软件都是基于它的。可是没有哪一个系统是十分完美的。

		自动登录	找回密码
密码			立即注册

[CentOS(社区)] CentOS教程之用python做一个简略的收集爬虫

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

相关帖子