Using Python to Access Web Data 笔记

前两个子课程的笔记就不记了，因为非常的基础。

Regular Expressions

正则表达式之前就没好好学，这里正好可以学一下。

首先python中的正则表达式和java中基本一致，也可以说所有编程语言中的正则表达式其实都是差不多的，还可以说正则表达式是一个独立的特征，这些语言都要支持这个特征。所有学习的是哪一门语言的正则表达式并不关键。

课程资料中所列：

character	mean
^	Matches the beginning of a line
$	Matches the end of the line
.	Matches any character
\s	Matches whitespace
\S	Matches any non-whitespace character
*	Repeats a character zero or more times
`*?`	Repeats a character zero or more times (non-greedy)
+	Repeats a character one or more times
+?	Repeats a character one or more times (non-greedy)
[aeiou]	Matches a single character in the listed set
^XYZ	Matches a single character not in the listed set
[a-z0-9]	The set of characters can include a range
(	Indicates where string extraction is to start
)	Indicates where string extraction is to end

使用正则表达式需要导包：

import re

函数以及语法直接看官方文档吧,Regular expression operations

直接上题：

题目：extract all the numbers in the file and compute the sum of the numbers

代码：

import re

file = open('test.txt') # 测试文件
one_line = file.read()  # 把文件读成一行
nums = re.findall('[^0-9]*([0-9]+)[^0-9]*', one_line) # 取出所有整数

print(len(nums))
total = 0
for num in nums:
    total += int(num)
    print('$',num)

print(total)

这里的正则表达式为[^0-9]*([0-9]+)[^0-9]*，也就是中间数字，前后存在除了数字之外的任意字符。

虽然上面的逻辑十分简单并且正确，但是最开始我并不是这样写的，我最开始想的是[^0-9]([0-9]+)[^0-9]，数字前后一定存在一个非数字字符，but对于例子：

Why should you learn to write programs? 7746
12 1929 8827
Writing programs (or programming) is a very creative
7 and rewarding activity.  You can write programs for
many reasons, ranging from making your living to solving
8837 a difficult data analysis problem to having fun to helping 128
someone else solve a problem.  This book assumes that
everyone needs to know how to program ...

输出为：

5
$ 7746
$ 1929
$ 7
$ 8837
$ 128
18647

正常输出应该为：

7
$ 7746
$ 12
$ 1929
$ 8827
$ 7
$ 8837
$ 128
27486

那么这里就少了12以及8827。难道这里是因为在匹配7746的时候，匹配的是7746\n，所以到了匹配12的时候，前面的\n被匹配了，所以它不能再使用。同样的，匹配1929时，它匹配的是1929，导致后面8827前面没有字符能够匹配了？

可能在findall()的时候就是使用的字符不再使用吧…

未完待续…

Networks and Sockets

感觉我自己是个假的学网络的，对于网络的了解还是太窄太片面了，以后恶补一下。

在python里面使用socket十分的简单，

下面的代码就能访问课程提供的一个网页，并且将这个html文件给打印出来，

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode())

mysock.close()

这里直接使用域名就能进行连接（之前学习socket都是用的ip，毕竟局域网，然而现在学的都退回去了）， 80端口就是web端口，然后可以直接加上文件来进行访问，这后面的HTTP/1.0\r\n\r\n现在还不能明白。 encode()和decode()到底进行了什么操作也还不能明白。

另外注意这里的域名协议是http，它和https好像是有一些区别的。

Programs that Surf the Web

这一节主要讲了一下python与网络中的一些编码。

首先解释了为什么在上面的socket中，要在发出request之前进行encode()，在接收信息后要先进行decode()。第一，现在网络编码绝大多数都是UTF-8，甚至可以默认一定是它，毕竟UTF-8包含了所有的常用语言文字。第二，在python内部，所有的字符都是unicode编码，那么问题就来了，unicode与UTF-8并不能直接通用，所有在字符出入的时候就要加上encode()与decode()。

关于UTF-8与unicode。首先快速参考一下知乎Unicode 和 UTF-8 有何区别?

首先，unicode是为了解决之前编码只考虑英文字符的问题而出现的，因为以前的ascii码只使用一个字节来表示字符，所以它最多也只能表示256个字符，如果只使用英文是够用的，但是事实是世界要发展，所以出现各种字符，那么它就不够用了。于是就出现了许多的编码方式，例如中国就出了GBK编码。但是这是不适于国家与国家之间的交流的，于是ISO就指定了unicode这个编码标准。

unicode现在通常是使用两个字节来表示一个字符，但是其实它可以被看成一个字符集，它将所有字符都定义了一个唯一的ID，这样网络就能有一个统一的字符表，不再出现之前的需要相互转化的问题。

那么新的问题就是直接使用unicode来表示字符时，它有时候会浪费空间，在编码表中靠前的字符，例如英文字符，它前一字节就是0000，后一字节才是它真正的序号。于是，在网络传输中，由于网络带宽并没有这么的理想，大家肯定就会嫌弃unicode编码浪费带宽，所以，于1992年创建，由Ken Thompson创建了UTF传输标准，它的全名是Unicode Transformation Format，这个全名就能明白了UTF的意思。

UTF是针对unicode的一种网络传输标准，按照我们通信的人来说，它就是一种针对unicode的编码方式。它现在有UTF-7、UTF-7.5、UTF-8、UTF-16、UTF-32几种格式，当然现在最为流行的就是其中的UTF-8。它是一种变长的编码方式，这样就能减少网络传输中的数据量，所以在网络传输中，大家都用它。

下面直接看作业代码：

作业的目标是将一个网页中的span标签中的整数进行求和，当然可以使用socket读取到网页，然后再使用正则表达式来去到整数。

但是这个作业规定使用BeautifulSoup来进行这个过程，它能够直接把网页给解析了，不过这个不是BeautifulSoup的功能，在python中原有就有支持这个功能的函数，BeautifulSoup主要是将网页中一些不规范的表达的地方的雷点给你踩了。也就是它总结了网页中很多不规范的奇葩写法，然后过滤这些，使得Soup更加好喝。

另外BeautifulSoup是一个额外的模块，可以使用pip install BeautifulSoup来安装。

注：主要是html语言太强了，很多奇葩的写法，它也不报错…

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read()

# html.parser is the HTML parser included in the standard Python 3 library.
# information on other HTML parsers is here:
# http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('span')
count = 0
sum = 0
for tag in tags:
    # Look at the parts of a tag
    sum += int(tag.contents[0])
    count += 1

print("Count", count)
print("Sum", sum)

代码就是调用方式，网页http://py4e-data.dr-chuck.net/comments_29102.html。

其中格式为<tr><td>Modu</td><td><span class="comments">90</span></td></tr>。这里就是去取其中的90。

代码二：

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL: ')
c = input('Enter count: ')
p = input('Enter position: ')

count = int(c)
posi = int(p)

for i in range(count+1):
    print("Retrieving: ", url)
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')

    # Retrieve all of the anchor tags
    tags = soup('a')
    url = tags[posi-1].get('href', None)

输入链接，跳转次数，从第几个链接跳转，这就像一个网络爬虫一样。

具体的细节完全都不懂，现在只知道表象。

Web Services and XML

这一节将的主要是XML语言，以及python对XML的一个解析，内容很简单。

直接上代码:

import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET

address = input('Enter location: ')
if len(address) < 1: 
    print('invalid location! Enter a location like \"http://py4e-data.dr-chuck.net/comments_42.xml\"')
    exit()

url = address
print('Retrieving', url, '...')
uh = urllib.request.urlopen(url)
data = uh.read()
print('Retrieved', len(data), 'characters')
# print(data.decode())
tree = ET.fromstring(data)

items = tree.findall('.//count')
print("Count", len(items))

total = 0
for item in items:
    total += int(item.text)

print("Sum:", total)

输入一个地址，使用urllib库来进行连接，然后调用xml.etree.ElementTree来解析其中的data，解析得到的就是一个XML树结构。然后将其中所有count节点的值加起来就行。

JSON and the REST Architecture

这一节其实主要讲了JSON的解析，它和XML类似，但是JSON更加轻量级，更加简单，可以使用python解析得到字典或者列表等数据结构。

import urllib.request, urllib.parse, urllib.error
import json

# Note that Google is increasingly requiring keys
# for this API
serviceurl = 'http://py4e-data.dr-chuck.net/geojson?'

while True:
    address = input('Enter location: ')
    if len(address) < 1: break

    url = serviceurl + urllib.parse.urlencode(
        {'address': address})

    print('Retrieving', url)
    uh = urllib.request.urlopen(url)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')

    try:
        js = json.loads(data)
    except:
        js = None

    if not js or 'status' not in js or js['status'] != 'OK':
        print('==== Failure To Retrieve ====')
        print(data)
        continue

    print("Place_id", js["results"][0]["place_id"])

这里使用了google提供的api，来查询到输入地址的位置信息。当然这里实际上调用的是py4e的接口，因为地址的位置信息可能会变，为了批改作业，它就把它改成了py4e的接口，这样就不会有变动。

可以看到json的解析十分简单，这是它的一大优势。