Project 1-3: 链接分析之链接统计

现在我们组分析源码和统计分析链接的工作正在同步进行，稍后还会有分析源码和统计分析链接的进度报告发布。本文说的是如何解析链接关系，供统计分析之用。一句话——人生苦短，我用Python。基本工作原理是遍历mirror下面的网页, 用正则表达式解析出链接地址, 然后输出链接关系.
最后得到的文件可以作为下一个程序的输入, 以统计网页出度入度和计算PR值.以下是源码： 1 # coding: utf-8
2 #
3
4 import os, re
5
6 rootdir= "/home/xxx/workspace/heritrix/jobs/ccer-20100930010817713/mirror/www.ccer.pku.edu.cn"
7
8 dotfile = open（"links.data", "w", 4096000）
9
10 count = 0
11 urllist = []
12
13 def append2list（url）:
14     if url not in urllist:
15         urllist.append（url）
16     return urllist.index（url）
17
18 def extract（dirr, name）:
19     #print "extracting:", dirr, name
20     f = open（dirr+"/"+name, "r"）
21     cururl = "http://" + dirr[dirr.find（"www.ccer.pku.edu.cn"）:] + "/" + name
22     curindex = append2list（cururl）
23
24     hrefs = re.findall（r"""href=（"|"）？（[^s""><（）]+）（1？）""", f.read（））
25     for href in hrefs:
26         if not href[0] == href[2]
27             or href[1] == "#"
28             or href[1] == "./"
29             or href[1].startswith（"mailto:"）
30             or href[1].startswith（"javascript"）
31             or href[1].endswith（".css"）
32             or href[1].endswith（".jpg"）
33             or href[1].endswith（".bmp"）
34             or href[1].endswith（".jpeg"）
35             or href[1].endswith（".ico"）
36             or href[1].endswith（".gif"）
37             or href[1].endswith（".pdf"）
38             or href[1].endswith（".ppt"）
39             or href[1].endswith（".doc"）
40             or href[1].endswith（".xls"）
41             or href[1].endswith（".pptx"）
42             or href[1].endswith（".docx"）
43             or href[1].endswith（".xlsx"）
44             or href[1].endswith（".zip"）
45             or href[1].endswith（".png"）:
46             pass
47         else:
48             realref = href[1]
49             if not realref.startswith（"http"）: #relative links
50                 if ".asp？" in realref:
51                     realref = realref.replace（".asp？", "", 1） + ".asp" # file name on disk
52                 realref = "http://" + dirr[dirr.find（"www.ccer.pku.edu.cn"）:] + "/" + realref
53             #print realref
54             refindex = append2list（realref）
55             global count
56             dotfile.write（"%d %d " % （curindex, refindex））
57             count += 1
58             if count % 10000 == 0:
59                 print count
60     #f.close（）
61
62 def filter（dummy, dirr, filess）:
63     for name in filess:
64         if os.path.splitext（name）[1] in [".asp", ".htm", ".html"] and os.path.isfile（dirr+"/"+name）:
65             extract（dirr, name）
66
67 os.path.walk（rootdir, filter, None）
68
69 dotfile.close（）
70
71 urlfile = open（"linkindex.txt", "w", 4096000）
72 for url in urllist:
73     urlfile.write（url + " "）
74 urlfile.close（） Project 1-2: 我们得到的Heritrix Crawl Job Report在Ubuntu 10.10 中配置Java环境变量的方法相关资讯 Project

Project 1-2: 我们得到的Heritrix （11/27/2010 07:45:59）
Project 2-1: 配置Lucene, 建立WEB （11/27/2010 07:39:40）

Project 1-1: Ubuntu下配置和运行（11/27/2010 07:43:53）

本文评论查看全部评论（0）