通过正则表达式进行数据清洗re.findall("需要清晰的数据",获取到的网页html代码),在使用前在代码开头调用importre模块
"
['百度一下,你就知道']2.3自定义请求通过request.Request()创建自定义请求对象,输出内容为
注意:data是一个列表对象
当把上面的代码改为手机版的User-Agent:
如下代码会输出请求到的html代码,有时会出现错误这是正常的,可以多运行几次代码。
re.search("查找的字符串","原数据")
re.findall("查找的字符串","原数据")返回的是一个列表
importrewithopen(r"C:/Users/MrFlySand/Desktop/testPy/英语.txt","rb")asf:data=f.read().decode()#print((data))n1=len(re.findall("homework",data))n2=len(re.findall("pros",data))n3=len(re.findall("cons",data))print("homework:",n1,"\n","pros:",n2,"\n","cons:",n3)printf(re.findall("cons",data))输出内容如下:
原子:正则表达式中实现匹配的基本单位
元字符:正则表达式中具有特殊合义的字符
importrestr="qq:2602629646,./飞沙MrFlySand"pat=r"qq"print(re.search(pat,str))pat=r"26\w"print(re.search(pat,str))pat=r"[A-Z][a-z]"print(re.search(pat,str))pat=r"[0-9][0-9][0-9]"print(re.search(pat,str))输出内容如下:
importrestr="2602629646,./FlySand飞沙MrFlySand112233"#从前面匹配任意2个字符pat=".."print(re.search(pat,str))#从开头匹配,以26开头+任意数字pat="^26\d"print(re.search(pat,str))#从末尾匹配,任意符号结尾pat=".$"print(re.search(pat,str))#从末尾匹配,任意2个字符+Fly结尾pat="..Fly$"print(re.search(pat,str))#任意个字符pat=".*"print(re.search(pat,str))##M开头,d结尾,中间任意个字符pat="M.*d"print(re.search(pat,str))##从字符串前面往后匹配,str开头就是2。匹配的结果是:0~n个2+3pat="2*3"print(re.search(pat,str))##从字符串前面往后匹配,str开头就是2。匹配的结果是:0~n个1+2pat="1*2"print(re.search(pat,str))#重复0次或者1次前面的原子。匹配的结果是:0~1个2+6pat="26"print(re.search(pat,str))#+重复0次或者1次前面的原子。匹配的结果是:1~n个2+3pat="2+3"print(re.search(pat,str))输出结果如下:
importrestr="2602629646"pat=r"\d{6}"#匹配6个数字print(re.search(pat,str))str="2602629646"pat=r"\d{11}"#匹配11个数字print(re.search(pat,str))str="2602629646"pat=r"\d{6,8}"#匹配6~8个数字print(re.search(pat,str))str="2602629646"pat=r"\d{6,8}"#","和"8"之间有空格,无法正常匹配print(re.search(pat,str))输出内容如下:
data输出内容
我们可以通过XPath,将HTML文件转换成XML文档,然后用XPath查找HTML节点或元素。通过这种方法会比正则表达式更加方便。
我们需要安装1xml模块来支持xpath的操作。
在cmd窗口中输入pipinstalllxml
frombase64importencodefromlxmlimportetreetext=''' mrflysand'''#etree.HTML()将字符串解析成特殊的html对齐html=etree.HTML(text)print(html)#将html对象转换成字符串,decode()转成中文result=etree.tostring(html,encoding="utf-8").decode()print(result)输出如下: txt中的文本 frombase64importencodeimporturllib.requestimportrefromlxmlimportetree#读取文档中所有的数据defreadAll(url):str=open(url,encoding="utf-8")strs=""#读取每一行forlineinstr.readlines():strs=line.strip()+strsreturnstrs#正则表达式匹配字符strs=readAll("C:/Users/MrFlySand/Desktop/1.txt")pat=re.compile(r'[a-z]+'or'[A-Z][a-z]+')data=pat.findall(strs)#统计每个单词出现的次数并加入到dict字典中dict={}foriinrange(0,len(data)):ifdata[i]indict:dictValue=dict[data[i]]+1dict.update({data[i]:dictValue})else:dict.update({data[i]:1})#根据字典的value值进行排序dict=sorted(dict.items(),key=lambdadict:dict[1],reverse=True)forkeyindict:print(key)输出结果: html文件中的内容 fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/2.html",parser=parser)result=etree.tostring(tree,encoding="utf-8").decode()print(result)输出内容:
fromlxmlimportetreeparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse(r'C:/mrflysand.html',parser=parser)5.4获取一类标签result=tree.xpath("//")获取一类标签,result=tree.xpath("//p")获取所有的p标签
fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/index.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//p")foriinrange(0,10):print(result[i].text)html代码如下:
fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/index.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//h1[@itemprop='name']")foriinrange(0,len(result)):print(result[i].text)输出内容如下:
注意:for循环中的i后面没有text,但5.5章节和5.7章节中有。
tree.xpath("//section/a/h1"),/表示下一级。
fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/index.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//section/a/h1")print(result)foriinresult:print(i.text)输出内容如下:
[
fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/1.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//li//span")foriinresult:print(i.text)1.html中的代码内容:
mrflysandmrflysand1飞沙飞沙1当py中的代码是result=tree.xpath("//li/span"),输出如下:
飞沙飞沙15.7.3实例三1.html代码如下:
fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/1.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//li//span//@class")foriinresult:print(i)输出如下:
abca15.8获取标签内容和标签名5.8.1实例一:获取倒数第二个标签html代码看5.7.3章节//li/span[last()]选中li下的span标签最后一个元素//li/span[last()-1]选中li下的span标签倒数第二个元素
parser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/1.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//li/span[last()-1]")print(result)foriinresult:print(i.text)输出内容:
[
mrflysand1飞沙5.8.2实例二:获取倒数第二个标签tree.xpath("//li/span")选中所有li标签下span标签,符合要求的有飞沙、飞沙1
fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/1.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//li/span")print(result[-2].text)输出如下:
飞沙5.8.3实例三:获取指定的class标签中的文本tree.xpath("//*[@class='abc']")获取所有标签下class值为abc的文本
fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/1.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//*[@class='abc']")print(result[0].text)