现在我们组分析源码和统计分析链接的工作正在同步进行,稍后还会有分析源码和统计分析链接的进度报告发布。本文说的是如何解析链接关系,供统计分析之用。一句话——人生苦短,我用Python。基本工作原理是遍历mirror下面的网页, 用正则表达式解析出链接地址, 然后输出链接关系.
最后得到的文件可以作为下一个程序的输入, 以统计网页出度入度和计算PR值.以下是源码: 1 # coding: utf-8
2 #
3
4 import os, re
5
6 rootdir= "/home/xxx/workspace/heritrix/jobs/ccer-20100930010817713/mirror/www.ccer.pku.edu.cn"
7
8 dotfile = open("links.data", "w", 4096000)
9
10 count = 0
11 urllist = []
12
13 def append2list(url):
14 if url not in urllist:
15 urllist.append(url)
16 return urllist.index(url)
17
18 def extract(dirr, name):
19 #print "extracting:", dirr, name
20 f = open(dirr+"/"+name, "r")
21 cururl = "http://" + dirr[dirr.find("www.ccer.pku.edu.cn"):] + "/" + name
22 curindex = append2list(cururl)
23
24 hrefs = re.findall(r"""href=("|")?([^s""><()]+)(1?)""", f.read())
25 for href in hrefs:
26 if not href[0] == href[2]
27 or href[1] == "#"
28 or href[1] == "./"
29 or href[1].startswith("mailto:")
30 or href[1].startswith("javascript")
31 or href[1].endswith(".css")
32 or href[1].endswith(".jpg")
33 or href[1].endswith(".bmp")
34 or href[1].endswith(".jpeg")
35 or href[1].endswith(".ico")
36 or href[1].endswith(".gif")
37 or href[1].endswith(".pdf")
38 or href[1].endswith(".ppt")
39 or href[1].endswith(".doc")
40 or href[1].endswith(".xls")
41 or href[1].endswith(".pptx")
42 or href[1].endswith(".docx")
43 or href[1].endswith(".xlsx")
44 or href[1].endswith(".zip")
45 or href[1].endswith(".png"):
46 pass
47 else:
48 realref = href[1]
49 if not realref.startswith("http"): #relative links
50 if ".asp?" in realref:
51 realref = realref.replace(".asp?", "", 1) + ".asp" # file name on disk
52 realref = "http://" + dirr[dirr.find("www.ccer.pku.edu.cn"):] + "/" + realref
53 #print realref
54 refindex = append2list(realref)
55 global count
56 dotfile.write("%d %d
" % (curindex, refindex))
57 count += 1
58 if count % 10000 == 0:
59 print count
60 #f.close()
61
62 def filter(dummy, dirr, filess):
63 for name in filess:
64 if os.path.splitext(name)[1] in [".asp", ".htm", ".html"] and os.path.isfile(dirr+"/"+name):
65 extract(dirr, name)
66
67 os.path.walk(rootdir, filter, None)
68
69 dotfile.close()
70
71 urlfile = open("linkindex.txt", "w", 4096000)
72 for url in urllist:
73 urlfile.write(url + "
")
74 urlfile.close() Project 1-2: 我们得到的Heritrix Crawl Job Report在Ubuntu 10.10 中配置Java环境变量的方法相关资讯 Project
- Project 1-2: 我们得到的Heritrix (11/27/2010 07:45:59)
- Project 2-1: 配置Lucene, 建立WEB (11/27/2010 07:39:40)
| - Project 1-1: Ubuntu下配置和运行 (11/27/2010 07:43:53)
|
本文评论 查看全部评论 (0)