Welcome 微信登录

首页 / 操作系统 / Linux / Project 1-3: 链接分析之链接统计

现在我们组分析源码和统计分析链接的工作正在同步进行,稍后还会有分析源码和统计分析链接的进度报告发布。本文说的是如何解析链接关系,供统计分析之用。一句话——人生苦短,我用Python。基本工作原理是遍历mirror下面的网页, 用正则表达式解析出链接地址, 然后输出链接关系.
最后得到的文件可以作为下一个程序的输入, 以统计网页出度入度和计算PR值.以下是源码: 1 # coding: utf-8
 2 # 
 3 
 4 import os, re
 5 
 6 rootdir= "/home/xxx/workspace/heritrix/jobs/ccer-20100930010817713/mirror/www.ccer.pku.edu.cn"
 7 
 8 dotfile = open("links.data", "w", 4096000)
 9 
10 count = 0
11 urllist = []
12 
13 def append2list(url):
14     if url not in urllist:
15         urllist.append(url)
16     return urllist.index(url)
17 
18 def extract(dirr, name):
19     #print "extracting:", dirr, name
20     f = open(dirr+"/"+name, "r")
21     cururl = "http://" + dirr[dirr.find("www.ccer.pku.edu.cn"):] + "/" + name
22     curindex = append2list(cururl)
23 
24     hrefs = re.findall(r"""href=("|")?([^s""><()]+)(1?)""", f.read())
25     for href in hrefs:
26         if not href[0] == href[2]
27             or href[1] == "#"
28             or href[1] == "./"
29             or href[1].startswith("mailto:")
30             or href[1].startswith("javascript")
31             or href[1].endswith(".css")
32             or href[1].endswith(".jpg")
33             or href[1].endswith(".bmp")
34             or href[1].endswith(".jpeg")
35             or href[1].endswith(".ico")
36             or href[1].endswith(".gif")
37             or href[1].endswith(".pdf")
38             or href[1].endswith(".ppt")
39             or href[1].endswith(".doc")
40             or href[1].endswith(".xls")
41             or href[1].endswith(".pptx")
42             or href[1].endswith(".docx")
43             or href[1].endswith(".xlsx")
44             or href[1].endswith(".zip")
45             or href[1].endswith(".png"):
46             pass
47         else:
48             realref = href[1]
49             if not realref.startswith("http"): #relative links
50                 if ".asp?" in realref:
51                     realref = realref.replace(".asp?", "", 1) + ".asp" # file name on disk
52                 realref = "http://" + dirr[dirr.find("www.ccer.pku.edu.cn"):] + "/" + realref
53             #print realref
54             refindex = append2list(realref)
55             global count
56             dotfile.write("%d %d " % (curindex, refindex))
57             count += 1
58             if count % 10000 == 0:
59                 print count
60     #f.close()
61 
62 def filter(dummy, dirr, filess):
63     for name in filess:
64         if os.path.splitext(name)[1] in [".asp", ".htm", ".html"] and os.path.isfile(dirr+"/"+name):
65             extract(dirr, name)
66 
67 os.path.walk(rootdir, filter, None)
68 
69 dotfile.close()
70 
71 urlfile = open("linkindex.txt", "w", 4096000)
72 for url in urllist:
73     urlfile.write(url + " ")
74 urlfile.close() Project 1-2: 我们得到的Heritrix Crawl Job Report在Ubuntu 10.10 中配置Java环境变量的方法相关资讯      Project 
  • Project 1-2: 我们得到的Heritrix   (11/27/2010 07:45:59)
  • Project 2-1: 配置Lucene, 建立WEB  (11/27/2010 07:39:40)
  • Project 1-1: Ubuntu下配置和运行  (11/27/2010 07:43:53)
本文评论 查看全部评论 (0)
表情: 姓名: 字数