写的自动抓取自己人人相册的python代码,用途貌似只有备份一下自己的相册。于是今天修改了专门针对人人网的爬虫,增加了自动抓取所有好友的功能,然后去他们的空间,把他(她)们的相册都下载回来(比较适合较多美女朋友的同学们..)...昨天发的文章有很多标签结果太长了,于是很悲剧地,修改的时候腾讯居然不给提交,XXXXX(省略一万字...)人人网是个很类似facebook的东东....为什么会很类似,因为中国特色....转入正题,因为怕以后忘了,所以写下来记录一下...好,第一点是名词解释。爬虫是神马?根据百度百科有: “网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本。.......传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。”偶针对人人做了一些特化(换句话说拿到其他网站就没用了),人人网要访问首先得有个帐号,也就是说要先登录,然后服务器就可以根据session或cookie来判断你在其他页面的登录情况,而对人人cookie就好了。当然,我们在一个浏览器登录,在另一个浏览器也可能还得要再登录一下,因为一般情况下他们不共享cookie,除非专门去读某个浏览器的cookie。于是爬虫要爬人人,首先要登录.....然后保存cookie。浏览器与服务器之间通讯主要都是Http协议,方法主要有GET和POST,(据《深入理解计算机系统》说,GET方法占了99%的HTTP请求。),GET方法主要向服务器发送比较短的数据,主要将参数写到URL里面,而POST方法则可以发送比较长的数据,例如发这篇文章的话,则是用了POST。想我们可以用"Telnet www.google.com 80",然后键入"Get /"就可以可以收到和我们在浏览器打上"http://www.google.com/"同样的东西。爬虫也一样,就是不断地GET,POST……要抓取所有好友的所有可见的相册有两种方法,一种是人工一个好友一个好友一个相册一个相册地下,另一种就是就给计算机让它自己去爬....因为我比较懒,所以选择第二种方法。又到了“要怎么怎么样,首先怎么怎么样”的句式了~要获取所有好友,可以在登录的情况下访问http://friend.renren.com/myfriendlistx.do,如果有用浏览器登录的话,好友会被javascript分成很多页显示。在网页的某段javascript中有个变量叫friends,保存所有好友的信息,里面都是{"id":254905709,"vip":false,"selected":true,"mo":true,"name":"u5b89u8feaAndy","head":"http://hdn.xnimg.cn/photos/hdn321/20110612/1600/h_tiny_zFLc_715e000281932f76.jpg","groups":["u534eu5357u7406u5de5u5927u5b66"]}这种元组,从这里,我们可以获取所有好友的id。要获取某个人的所有相册,可以访问http://www.renren.com/profile.do?id=(某人的id)&v=photo_ajax&undefined,这个是怎么找出来的呢?我们登录一个人的主页时,然后点击相册,这个页面并没有刷新,只是由AJAX替换了页面的一部分,它就是去Get那个路径,就返回了网页的一部分代码过来,替换掉现在的。所以我们也可以去Get那个路径,就可以获得包含所有相册id的页面。要获取一个相册里面的所有照片,这个要靠人人的一个Bug了,很无意发现的,你可以打开别人相册的排序照片的页面。在排序的页面,一个相册所有的照片都列出来了,通过正则表达式,我们就可以拿到每张照片的id。排序的页面为http://photo.renren.com/photo/(某人的id)/album-(相册id)/reorder。经过了三句“要怎么怎么样,首先怎么怎么样”,我们拿到了所有好友的id,所有好友的所有相册的id,和所有好友的所有相册的所有照片的id。为什么都是id呢?这个个人觉得用一个整数作为数据库元组的主码,性能会高些,而且对于一个32位整数,只占4字节,就可以标识4294967296个东西了。加上在客户与服务器之间传送id也方便。拥有这些id我们可以做什么,目前什么都做不了,我们访问http://photo.renren.com/photo/(某人的id)/photo-(相片id)就可以在网页中代码中发现AJAX返回的一段代码代码中有一句"largeurl":"http://fmn.rrimg.com/fmn049/20110621/1520/p_large_S5jA_37eb000165dc5c3f.jpg",这就是一张照片的真正地址了,然后我们把里面的""给删掉就可以下载了。相关文件下载:
免费下载地址在 http://linux.linuxidc.com/
用户名与密码都是www.linuxidc.com
具体下载目录在 /pub/2011/08/25/Python自动下载人人所有好友的相册/好,于是我们就可以这样写出一个残缺不全的爬虫了..........对于人人的新鲜事,可以把一个页面的url抓出来筛选后放到一个优先队列里,再从优先队列里选一个最优的进入,重复上一步,直到队列为空或者其他情况....呃,传说中的中文伪代码....程序在Ubuntu 11.04下测试正常,在Windows下可能会有乱码....
- # -*-coding:utf-8-*-
- # Filename:main.py
- # 作者:华亮
- #
-
- from Renren import SuperRenren
- import time
-
- def main():
- renren = SuperRenren()
- if renren.Create("人人帐号", "人人密码"):
- #renren.PostMsg(time.asctime())
- #renren.PostGroupMsg("387635422", "%s" % time.asctime())
- #renren.DownloadAlbum("333982368", "sss")
- renren.DownloadAllFriendsAlbums()
-
- if __name__ == "__main__":
- main()
-
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
- <pre name="code" class="python"># -*- coding:utf-8 -*-
- # Filename:Renren.py
- # 作者:华亮
- #
-
- from HTMLParser import HTMLParser
- from Queue import Empty
- from Queue import Queue
- from re import match
- from sys import exit
- from urllib import urlencode
- import os
- import re
- import socket
- import threading
- import time
- import urllib
- import urllib2
- import shelve
-
-
- # 提供给输出的互斥对象
- GlobalPrintMutex = threading.Lock()
- # 提供输出config.cfg的互斥对象
- GlobalWriteConfigMutex = threading.Lock()
- # 提供保存用户最后更新的互斥对象
- GlobalShelveMutex = threading.Lock()
-
-
- # 根据平台不同选择不同的路径分割符
- Delimiter = "/" if os.name == "posix" else "\"
-
- ConfigFilename = "config.cfg" # 每个相册的已经下载的图片id
- LastUpdatedFileName = "lastupdated.cfg" # 所有人的最后更新时间
- UpdateThreashold = 10 * 60 # 更新时间
-
- # 多核情况下的输出
- def MutexPrint(content):
- GlobalPrintMutex.acquire()
- print content
- GlobalPrintMutex.release()
-
- def MutexWriteFile(file, content):
- GlobalWriteConfigMutex.acquire()
- file.write(content)
- file.flush()
- GlobalWriteConfigMutex.release()
-
-
- # 字符串形式的unicode转成真正的字符
- def Str2Uni(str):
- import re
- pat = re.compile(r"\u(w{4})")
- lst = pat.findall(str)
- lst.insert(0, "")
- return reduce(lambda x,y: x + unichr(int(y, 16)), lst)
-
- #------------------------------------------------------------------------------
- # 下载文件的下载者
- class Downloader(threading.Thread):
- def __init__(self, urlQueue, failedQueue, file=None):
- threading.Thread.__init__(self)
- self.queue = urlQueue
- self.failedQueue = failedQueue
- self.file = file
-
- def run(self):
- try:
- while not self.queue.empty():
- pid, url, filename = self.queue.get()
- isfile = os.path.isfile(filename)
- MutexPrint((" Downloading %s" if not isfile else " Exists %s") % filename.decode("utf-8"))
- if not isfile: urllib.urlretrieve(url, filename.decode("utf-8"))
- MutexWriteFile(self.file, pid + "
")
- except Empty:
- pass
- except Exception, e:
- self.failedQueue.put(pid)
- MutexPrint(" Error occured when downloading photo which id = %s" % pid)
- MutexPrint(e)
-
-
-
-
- #------------------------------------------------------------------------------
- # 人人相册的解析
- class RenrenAlbums(HTMLParser):
- in_key_div = False
- in_ul = False
- in_li = False
- in_a = False
- albumsUrl = []
-
- def handle_starttag(self, tag, attrs):
- attrs = dict(attrs)
- if tag == "div" and "class" in attrs and attrs["class"] == "big-album album-list clearfix":
- self.in_key_div = True
- elif self.in_key_div:
- if tag == "ul":
- self.in_ul = True
- elif self.in_ul and tag == "li":
- self.in_li = True
- if self.in_li and tag == "a" and "href" in attrs:
- self.in_a = True
- self.albumsUrl.append(attrs["href"])
-
- def handle_data(self, data):
- pass
-
- def handle_endtag(self, tag):
- if self.in_key_div and tag == "div":
- self.in_key_div = False
- elif self.in_ul and tag == "ul":
- self.in_ul = False
- elif self.in_li and tag == "li":
- self.in_li = False
- elif self.in_a and tag == "a":
- self.in_a = False
-
-
- class RenrenRequester:
- """""
- 人人访问器
- """
- LoginUrl = "http://www.renren.com/PLogin.do"
- # 输入用户和密码的元组
- def Create(self, username, password):
- loginData = {"email":username,
- "password":password,
- "origURL":"",
- "formName":"",
- "method":"",
- "isplogin":"true",
- "submit":"登录"}
- postData = urlencode(loginData)
- cookieFile = urllib2.HTTPCookieProcessor()
- self.opener = urllib2.build_opener(cookieFile)
- req = urllib2.Request(self.LoginUrl, postData)
- result = self.opener.open(req)
- if not (result.geturl() == "http://www.renren.com/home" or "http://guide.renren.com/guide"):
- return False
-
- rawHtml = result.read()
- # 获取用户id
- useridPattern = re.compile(r"user : {"id" : (d+?)}")
- self.userid = useridPattern.search(rawHtml).group(1)
-
- # 查找requestToken
- pos = rawHtml.find("get_check:"")
- if pos == -1: return False
- rawHtml = rawHtml[pos + 11:]
- token = match("-d+", rawHtml)
- if token is None:
- token = match("d+", rawHtml)
- if token is None: return False
- self.requestToken = token.group()
- self.__isLogin = True
- return self.__isLogin
-
- def GetRequestToken(self):
- return self.requestToken
-
- def GetUserId(self):
- return self.userid
-
- def Request(self, url, data = None):
- if self.__isLogin:
- if data:
- encodeData = urlencode(data)
- request = urllib2.Request(url, encodeData)
- else:
- request = urllib2.Request(url)
- result = self.opener.open(request)
- return result
- else:
- return None
-
-
- class RenrenPostMsg:
- """""
- RenrenPostMsg
- 发布人人状态
- """
- newStatusUrl = "http://status.renren.com/doing/updateNew.do"
-
- def Handle(self, requester, param):
- requestToken, msg = param
-
- statusData = {"content":msg,
- "isAtHome":"1",
- "requestToken":requestToken}
- postStatusData = urlencode(statusData)
-
- requester.Request(self.newStatusUrl, statusData)
-
- return True
-
-
- class RenrenPostGroupMsg:
- """""
- RenrenPostGroupMsg
- 发布人人小组状态
- """
- newGroupStatusUrl = "http://qun.renren.com/qun/ugc/create/status"
-
- def Handle(self, requester, param):
- requestToken, groupId, msg = param
- statusData = {"minigroupId":groupId,
- "content":msg,
- "requestToken":requestToken}
- requester.Request(self.newGroupStatusUrl, statusData)
-
-
- class RenrenFriendList:
- """""
- RenrenFriendList
- 人人好友列表
- """
- def Handler(self, requester, param):
- friendUrl = "http://friend.renren.com/myfriendlistx.do"
- rawHtml = requester.Request(friendUrl).read()
-
- friendInfoPack = re.search(r"var friends=[(.*?)];", rawHtml).group(1)
- friendIdPattern = re.compile(r""id":(d+).*?"name":"(.*?)"")
- friendIdList = []
- for id, name in friendIdPattern.findall(friendInfoPack):
- friendIdList.append((id, Str2Uni(name)))
-
- return friendIdList
-
-
- class RenrenAlbumDownloader:
- """""
- AlbumDownloader
- 相册下载者,记录已经下载的照片id到config.cfg,不会重新下载
- """
- threadNumber = 10 # 下载线程数
-
- def Handler(self, requester, param):
- self.requester = requester
- userid, path = param
- self.__DownloadOneAlbum(userid, path)
-
-
- # 解析html获取人名
- def __GetPeopleNameFromHtml(self, rawHtml):
- peopleNamePattern = re.compile(r"<h2>(.*?)<span>")
- # 取得人名
- peopleName = peopleNamePattern.search(rawHtml).group(1).strip()
- return peopleName
-
- def __GetAlbumsNameFromHtml(self, rawHtml):
- albumUrlPattern = re.compile(r"<a href="(.*?)" stats="album_album"><img.*?/>(.*?)</a>")
- albums = []
- # 把相册路径定向到排序页面,就可以在那个页面获得该相册下所有的相片的id
- for album_url, album_name in albumUrlPattern.findall(rawHtml):
- albums.append((album_name.strip(), album_url + "/reorder"))
- return albums
-
- def __GetAlbumPhotos(self, userid, albumUrl):
- # 匹配的正则表达式
- # 照片id
- pidPattern = re.compile(r"<li pid="(d+)".*?>.*?</li>", re.S)
- # 访问所有包含所有相册的页面
- result = self.requester.Request(albumUrl)
- rawHtml = result.read()
- photohtmlurl = [] # 每张照片的页面
- for pid in pidPattern.findall(rawHtml):
- photohtmlurl.append((pid, "http://photo.renren.com/photo/%s/photo-%s" % (userid, pid)))
-
- return photohtmlurl
-
-
- def __GetRealPhotoUrls(self, photohtmlurl):
- # 访问每个相册,获取所有照片,并修正相片的url
- # 照片地址
- imgPattern = re.compile(r""largeurl":"(.*?)"")
- imgUrl = [] # id与真实照片的url
- for pid, url in photohtmlurl:
- result = self.requester.Request(url)
- rawHtml = result.read()
- for img in imgPattern.findall(rawHtml):
- imgUrl.append((pid, img.replace("\", "")))
-
- return imgUrl
-
- def __DownloadAlbum(self, savepath, album_name, imgUrl, file):
- # 下载相册所有图片
- # 将下载文件压入队列
- queue = Queue()
- failedQueue = Queue()
- for pid, url in imgUrl:
- imgname = url.split("/")[-1]
- queue.put((pid, url, savepath + Delimiter + imgname))
- # 启动多线程下载
- threads = []
- for i in range(self.threadNumber):
- downloader = Downloader(queue, failedQueue, file)
- threads.append(downloader)
- downloader.start()
- # 等待所有线程完成
- for t in threads:
- t.join()
- # 返回相片队列
- return failedQueue
-
-
- # 下载某人的相册
- def __DownloadOneAlbum(self, userid, path="albums"):
- #if not self.__isLogin: return
- if os.path.exists(path.decode("utf-8")) == False: os.mkdir(path.decode("utf-8"))
-
- albumsUrl = "http://www.renren.com/profile.do?id=%s&v=photo_ajax&undefined" % userid
-
- try:
- # 取出相册和路径
- result = self.requester.Request(albumsUrl)
- rawHtml = result.read()
- # 取得人名
- peopleName = self.__GetPeopleNameFromHtml(rawHtml).strip()
- albums = self.__GetAlbumsNameFromHtml(rawHtml)
-
- # 根据人名建文件夹
- path += Delimiter + peopleName
- if os.path.exists(path.decode("utf-8")) == False: os.mkdir(path.decode("utf-8"))
-
- # 开始进入相册下载
- MutexPrint("Enter %s" % peopleName.decode("utf-8"))
- for album_name, albumUrl in albums:
- MutexPrint("Downloading Album: %s" % album_name.decode("utf-8"))
- # 获取该相册下照片id和照片地址的表
- photohtmlurl = self.__GetAlbumPhotos(userid, albumUrl)
-
- # 按相册名建文件夹
- album_name = album_name.replace("\", "")
- album_name = album_name.replace("/", "")
- savepath = path + Delimiter + album_name
- if os.path.exists(savepath.decode("utf-8")) == False: os.mkdir(savepath.decode("utf-8"))
-
- #
- newDownloadIdSet = set()
- finishedIdSet = set()
- totalIdSet = set()
- for pid, url in photohtmlurl:
- totalIdSet.add(pid)
-
- configFile = savepath + Delimiter + ConfigFilename
- if os.path.isfile(configFile):
- # 读取已经完成的照片以免重复访问获取大图地址的页面
- file = open(configFile.decode("utf-8"), "r")
- photoIdMap = []
- for line in file.readlines():
- line = line.strip()
- pid = line
- photoIdMap.append(pid)
- file.close()
- finishedIdSet = set(photoIdMap)
-
- newDownloadIdSet = totalIdSet - finishedIdSet
-
- newDownloadPhotoHtmlUrl = ((pid, url) for pid, url in photohtmlurl if pid in newDownloadIdSet)
-
- imgUrl = self.__GetRealPhotoUrls(newDownloadPhotoHtmlUrl)
-
-
- # 下载照片
- try:
- file = open(configFile.decode("utf-8"), "w")
- for id in finishedIdSet:
- file.write(id + "
")
- file.flush()
-
- failedQueue = self.__DownloadAlbum(savepath, album_name, imgUrl, file)
-
-