Bilibili弹幕爬取与分析yry11|bilibili主播弹幕宠物怎么设置_宠物用品

首先分析B站网页端结构，寻找规律，找出弹幕位于网页的位置。

再将爬取的数据进行持久化处理，后进行各项分析。

三．主题页面的结构特征分析

其中各个参数分别表示：

mode:弹幕类型(<7时为普通弹幕)

size:字号

color:文字颜色

pool:弹幕池ID

author:发送者ID

dbid:数据库记录ID（单调递增）

使用正则表达式，从xml文件中筛选出关键信息，数据获取环节结束。

一．网络爬虫程序设计

导入requests库，使用request.get方法访问弹幕url:：

importrequests

为了避免受反爬机制限制，添加headers

headers={

'user-agent':'Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/81.0.4044.138Safari/537.36',

}

然后发送请求即可。

结合三中所叙述的解析方式，即可获得该视频所以的弹幕信息。

保存为csv文件即可。

具体代码如下：

defget_data():

url=input('请输入B站视频链接:')

res=requests.get(url)

cid=re.findall(r'"cid":(.*),',res.text)[-1]

res=requests.get(url,headers=headers)

xml_content=res.content.decode('utf-8')

re_patern='(.*)'

comments=re.findall(re_patern,xml_content)

danmus=[]

foritemincomments:

danmus.append(','.join(item))

#列标

headers=['stime','mode','size','color','date','pool','author','dbid','','text']

headers=','.join(headers)

danmus.insert(0,headers)

#弹幕数据结果保存为danmus.csv

withopen('danmus.csv','w',encoding='utf_8_sig')asf:

data=[]

forlineindanmus:

data.append(line+'\n')

f.writelines(data)

由于部分数据经过正则表达式筛选后，仍然不符合统一格式，在读入数据时直接进行异常输出忽略，实现清洗。具体代码如下：

df=pd.read_csv('danmus.csv',error_bad_lines=False)

(1)弹幕词云

实现代码如下：

defword_cloud_main():

withopen('danmus.csv',encoding='utf-8')asf:

lst=[]

forlineinf.readlines():

tmp=line.split(',')[-1]

lst.append(tmp)

text="".join(lst)

words=jieba.cut(text)

_dict={}

forwordinwords:

iflen(word)>=2:

_dict[word]=_dict.get(word,0)+1

items=list(_dict.items())

items.sort(key=lambdax:x[1],reverse=True)

#设置字体保证正常显示中文

plt.rcParams['font.family']=['sans-serif']

plt.rcParams['font.size']='8'

plt.rcParams['font.sans-serif']=['SimHei']

print(items)

w=wordcloud.WordCloud(

width=1000,height=700,

background_color="white",

font_path="msyh.ttc",

max_words=30,

)

w.generate_from_frequencies(_dict)#以词云生成词云

#保存词云图

w.to_file("wordcloud.png")运行结果：

（2）极性分析并绘制极性饼图图

defans_emotion():

text=[]

forlineinf.readlines()[1:]:

text.append(line.split(',')[-1])

emotions={

'positive':0,

'negative':0,

'neutral':0

foritemintext:

ifSnowNLP(item).sentiments>0.6:

emotions['positive']+=1

elifSnowNLP(item).sentiments<0.4:

emotions['negative']+=1

else:

emotions['neutral']+=1

plt.rcParams['font.size']='14'

#print(emotions.keys())

#print(emotions.values())

plt.pie(emotions.values(),

labels=emotions.keys(),#设置饼图标签

plt.title("弹幕情感极性分析饼图")#设置标题

plt.savefig('弹幕情感极性分析饼图.png')

plt.show()运行结果：

（3）弹幕数趋势图

defline_chart():

warnings.filterwarnings("ignore")

col_lst=['stime','mode','size','color','date','pool','author','dbid','','text']

#print(df)

df.columns=col_lst

date_stamp=df['date']

res_date=[]

#print(df['date'])

fordateindate_stamp:

date=time.localtime(date)

str_date=time.strftime('%Y-%m-%d',date)

res_date.append(str_date)

#print(res_date)

res_date=pd.Series(res_date)

date_count=res_date.value_counts()

date_lst=[]

count_lst=[]

foriindate_count.index:

date_lst.append(i)

foriindate_count.values:

count_lst.append(i)

#print(date_lst)

#print(count_lst)

count_dict={}

foriinrange(len(date_lst)):

count_dict[date_lst[i]]=count_lst[i]

sorted_count=sorted(count_dict.items(),key=lambdax:x[0])

#print(sorted_count)

#date

x=[]

#count

y=[]

foriinsorted_count:

x.append(i[0])

y.append(i[1])

print(x)

print(y)

importmatplotlib.pyplotasplt

fig1,ax=plt.subplots(figsize=(14,9))

ax.plot(x,y)

xticks=list(range(0,len(x),20))

xlabels=[x[i]foriinxticks]

xticks.append(len(x))

xlabels.append(x[-1])

ax.set_xticks(xticks)

ax.set_xticklabels(xlabels,rotation=80)

ymajorLocator=MultipleLocator(10)

ax.yaxis.set_major_locator(ymajorLocator)

#ax.xaxis.set_major_locator(ticker.MultipleLocator(40))

THE END

Bilibili弹幕爬取与分析yry11

B站这套设计组合拳，直接打到我的心巴上b站鬼畜进度条视频页

bilibili直播姬怎么用bilibili直播姬使用教程

Bilibili弹幕爬取与分析yry11