BeautifulSoup

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup=BeautifulSoup(html, "lxml")
print soup.prettify()
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

数据里面有多个<a> </a>标签,如果直接输出则默认只输出第一个;
如果要输出全部,则需要使用findAll()函数

print soup.a
c= soup.findAll("a")
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
print c[0].string
 Elsie 
print soup.title.string
The Dormouse's story
n=soup.find(id="link3").string
n.string
u'Tillie'
for i in soup.findAll("a"):
    print i.string
 Elsie 
Lacie
Tillie
#coding=utf-8

import urllib2

def getHtlm(url):
    page=urllib2.urlopen(url)
    html=page.read()
    return html

html_test = getHtlm("https://movie.douban.com/subject/26270502/?from=showing")
movie=BeautifulSoup(html_test, "lxml")
# movie=movie.prettify()

print movie.find(property="v:votes").string
62113
print movie.body.find(property="v:itemreviewed")
<span property="v:itemreviewed">绣春刀II:修罗战场</span>
print movie.body.find(rel="v:directedBy").string
路阳
print movie.body.find(rel="v:starring")
<a href="/celebrity/1077991/" rel="v:starring">张震</a>

虽然抓取页面是utf-8,但直接使用输出findAll函数的结果会输出乱码,如下所示。findAll函数产生的结果是一个list,解决办法是循环输出。
参考来源:http://blog.csdn.net/fk103/article/details/52972131

print movie.body.findAll(rel="v:starring")
[<a href="/celebrity/1077991/" rel="v:starring">\u5f20\u9707</a>, <a href="/celebrity/1052359/" rel="v:starring">\u6768\u5e42</a>, <a href="/celebrity/1274761/" rel="v:starring">\u5f20\u8bd1</a>, <a href="/celebrity/1312940/" rel="v:starring">\u96f7\u4f73\u97f3</a>, <a href="/celebrity/1318720/" rel="v:starring">\u8f9b\u82b7\u857e</a>, <a href="/celebrity/1275482/" rel="v:starring">\u91d1\u58eb\u6770</a>, <a href="/celebrity/1376605/" rel="v:starring">\u5218\u7aef\u7aef</a>, <a href="/celebrity/1351719/" rel="v:starring">\u6b66\u5f3a</a>, <a href="/celebrity/1342478/" rel="v:starring">\u6768\u8f76</a>, <a href="/celebrity/1314374/" rel="v:starring">\u674e\u5a9b</a>, <a href="/celebrity/1315721/" rel="v:starring">\u5434\u6653\u4eae</a>, <a href="/celebrity/1317230/" rel="v:starring">\u674e\u6d2a\u6d9b</a>, <a href="/celebrity/1332806/" rel="v:starring">\u5218\u5cf0\u8d85</a>, <a href="/celebrity/1274820/" rel="v:starring">\u8881\u6587\u5eb7</a>, <a href="/celebrity/1371225/" rel="v:starring">\u9a6c\u8d6b</a>, <a href="/celebrity/1326377/" rel="v:starring">\u5218\u4ead\u4f5c</a>, <a href="/celebrity/1322077/" rel="v:starring">\u59dc\u6653\u51b2</a>, <a href="/celebrity/1376606/" rel="v:starring">\u9648\u9f50\u5a01</a>, <a href="/celebrity/1319834/" rel="v:starring">\u738b\u4ec1\u541b</a>]
for a in movie.body.findAll(rel="v:starring"):
    print a.string.encode("utf-8")
张震
杨幂
张译
雷佳音
辛芷蕾
金士杰
刘端端
武强
杨轶
李媛
吴晓亮
李洪涛
刘峰超
袁文康
马赫
刘亭作
姜晓冲
陈齐威
王仁君
2017-07-25 13:17 36
Comments
Write a Comment