python爬虫教程MrFlySand飞沙|python画xrd图教程_宠物造型

通过正则表达式进行数据清洗re.findall("需要清晰的数据",获取到的网页html代码)，在使用前在代码开头调用importre模块

"(.*)"括号中的内容会被保留下来，其余的内容会被去掉

['百度一下，你就知道']2.3自定义请求通过request.Request()创建自定义请求对象，输出内容为

注意：data是一个列表对象

当把上面的代码改为手机版的User-Agent：

如下代码会输出请求到的html代码，有时会出现错误这是正常的，可以多运行几次代码。

re.search("查找的字符串","原数据")

re.findall("查找的字符串","原数据")返回的是一个列表

importrewithopen(r"C:/Users/MrFlySand/Desktop/testPy/英语.txt","rb")asf:data=f.read().decode()#print((data))n1=len(re.findall("homework",data))n2=len(re.findall("pros",data))n3=len(re.findall("cons",data))print("homework:",n1,"\n","pros:",n2,"\n","cons:",n3)printf(re.findall("cons",data))输出内容如下：

原子:正则表达式中实现匹配的基本单位

元字符:正则表达式中具有特殊合义的字符

importrestr="qq:2602629646,./飞沙MrFlySand"pat=r"qq"print(re.search(pat,str))pat=r"26\w"print(re.search(pat,str))pat=r"[A-Z][a-z]"print(re.search(pat,str))pat=r"[0-9][0-9][0-9]"print(re.search(pat,str))输出内容如下：

4.7元字符元字符：正则表达式中具有特殊合义的字符

importrestr="2602629646,./FlySand飞沙MrFlySand112233"#从前面匹配任意2个字符pat=".."print(re.search(pat,str))#从开头匹配，以26开头+任意数字pat="^26\d"print(re.search(pat,str))#从末尾匹配，任意符号结尾pat=".$"print(re.search(pat,str))#从末尾匹配，任意2个字符+Fly结尾pat="..Fly$"print(re.search(pat,str))#任意个字符pat=".*"print(re.search(pat,str))##M开头，d结尾，中间任意个字符pat="M.*d"print(re.search(pat,str))##从字符串前面往后匹配，str开头就是2。匹配的结果是：0~n个2+3pat="2*3"print(re.search(pat,str))##从字符串前面往后匹配，str开头就是2。匹配的结果是：0~n个1+2pat="1*2"print(re.search(pat,str))#重复0次或者1次前面的原子。匹配的结果是：0~1个2+6pat="26"print(re.search(pat,str))#+重复0次或者1次前面的原子。匹配的结果是：1~n个2+3pat="2+3"print(re.search(pat,str))输出结果如下：

None4.8匹配固定次数{n}前面的原子出现了n次{n,}至少出现n次{n,m}出现次数介于n-m之间，注意n,m之间不能有空格

importrestr="2602629646"pat=r"\d{6}"#匹配6个数字print(re.search(pat,str))str="2602629646"pat=r"\d{11}"#匹配11个数字print(re.search(pat,str))str="2602629646"pat=r"\d{6,8}"#匹配6~8个数字print(re.search(pat,str))str="2602629646"pat=r"\d{6,8}"#","和"8"之间有空格，无法正常匹配print(re.search(pat,str))输出内容如下：

data输出内容

我们可以通过XPath，将HTML文件转换成XML文档，然后用XPath查找HTML节点或元素。通过这种方法会比正则表达式更加方便。

我们需要安装1xml模块来支持xpath的操作。

在cmd窗口中输入pipinstalllxml

frombase64importencodefromlxmlimportetreetext='''python有两种导入模块的方法，分别如下

mrflysand'''#etree.HTML()将字符串解析成特殊的html对齐html=etree.HTML(text)print(html)#将html对象转换成字符串，decode()转成中文result=etree.tostring(html,encoding="utf-8").decode()print(result)输出如下：

txt中的文本

frombase64importencodeimporturllib.requestimportrefromlxmlimportetree#读取文档中所有的数据defreadAll(url):str=open(url,encoding="utf-8")strs=""#读取每一行forlineinstr.readlines():strs=line.strip()+strsreturnstrs#正则表达式匹配字符strs=readAll("C:/Users/MrFlySand/Desktop/1.txt")pat=re.compile(r'[a-z]+'or'[A-Z][a-z]+')data=pat.findall(strs)#统计每个单词出现的次数并加入到dict字典中dict={}foriinrange(0,len(data)):ifdata[i]indict:dictValue=dict[data[i]]+1dict.update({data[i]:dictValue})else:dict.update({data[i]:1})#根据字典的value值进行排序dict=sorted(dict.items(),key=lambdadict:dict[1],reverse=True)forkeyindict:print(key)输出结果：

html文件中的内容

fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/2.html",parser=parser)result=etree.tostring(tree,encoding="utf-8").decode()print(result)输出内容：

12 飞沙 5.3.2.1lxml.etree.XMLSyntaxError:Openingandendingtagmismatch错误的解决办法解决方法：创建html解析器，增加parser，指定编码格式tree的结果是

fromlxmlimportetreeparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse(r'C:/mrflysand.html',parser=parser)5.4获取一类标签result=tree.xpath("//")获取一类标签，result=tree.xpath("//p")获取所有的p标签

fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/index.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//p")foriinrange(0,10):print(result[i].text)html代码如下：

fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/index.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//h1[@itemprop='name']")foriinrange(0,len(result)):print(result[i].text)输出内容如下：

注意：for循环中的i后面没有text，但5.5章节和5.7章节中有。

tree.xpath("//section/a/h1")，/表示下一级。

fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/index.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//section/a/h1")print(result)foriinresult:print(i.text)输出内容如下：

[,,]排序算法Git命令树的最大深度5.7.2实例二\\获取所有符合条件的子标签

fromlxmlimportetreeimportreparser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/1.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//li//span")foriinresult:print(i.text)1.html中的代码内容：

mrflysandmrflysand1飞沙飞沙1当py中的代码是result=tree.xpath("//li/span")，输出如下：

飞沙飞沙15.7.3实例三1.html代码如下：

abca15.8获取标签内容和标签名5.8.1实例一：获取倒数第二个标签html代码看5.7.3章节//li/span[last()]选中li下的span标签最后一个元素//li/span[last()-1]选中li下的span标签倒数第二个元素

parser=etree.HTMLParser(encoding='utf-8')tree=etree.parse("C:/Users/123/Desktop/1.html",parser=parser)html=etree.tostring(tree,encoding="utf-8").decode()result=tree.xpath("//li/span[last()-1]")print(result)foriinresult:print(i.text)输出内容：

[]飞沙当代码成为result=tree.xpath("//li//span[last()-1]")时，选中的是所有li下的所有span标签，输出如下：

mrflysand1飞沙5.8.2实例二：获取倒数第二个标签tree.xpath("//li/span")选中所有li标签下span标签，符合要求的有飞沙、飞沙1

飞沙5.8.3实例三：获取指定的class标签中的文本tree.xpath("//*[@class='abc']")获取所有标签下class值为abc的文本

THE END

python爬虫教程MrFlySand飞沙

2024年用python绘制散点图

python爬虫教程MrFlySand飞沙

爱科研，用鸿之微云新闻详情页

使用Python进行数学建模(语言基础1)