背景:从提供的金融文本中识别出未出现的未知金融实体
一、简单的熟悉数据
使用数据:
importpandasaspd#原始数据集train_df=pd.read_csv('./train.csv',encoding='utf-8')test_df=pd.read_csv('./test.csv',encoding='utf-8')部分数据如下:
二、清理数据
(1)找出所有的非中文、非英文、非数字符号
#一些需要保留的符号extra_chars=set("!#$%&\()*+,-./:;<=>@[\\]^_`{|}~!#¥%&?《》{}“”,:‘’。()·、;【】")print(extra_chars){')','\\','+','>','¥','‘','=','【','#',';','^','|','{','@','}','-','/',':','%','“','、','!','',']','_','】','&','~','(',')','*','?','。','[',':',';',',',',','!','.','<','’','`','(','》','·','《','”','$'}(3)找出他们之间的差异
train_df['text']=train_df['title'].fillna('')+train_df['text'].fillna('')test_df['text']=test_df['title'].fillna('')+test_df['text'].fillna('')#清除噪声train_df['text']=train_df['text'].apply(stop_words)test_df['text']=test_df['text'].apply(stop_words)train_df=train_df.fillna('')可视化train_df:
三、探索数据
(1)原始数据中可能存在一些错误的标签我们需要将其找出来
train_df['unknownEntities']=label_listtrain_df=train_df[~train_df['unknownEntities'].isnull()]#删除空标签train_df.to_csv('new_train_df.csv')new_test_df=test_df[:]#测试集new_test_df.to_csv('new_test_df.csv',encoding='utf-8',index=False)(4)看一下句子长度的分布
重新加载初步处理好的数据:
统计一下每个区间的长度的个数:
看下总体描述:
最大长度是32212,最小长度是4,75%的数据长度在1357以下。
句子还是比较长的,我们需进分句处理:
new_train_df=new_train_df.loc[:,~new_train_df.columns.str.contains("^Unnamed")]#切分训练集,分成训练集和验证集,在这可以尝试五折切割print('TrainSetSize:',new_train_df.shape)new_dev_df=new_train_df[4000:]frames=[new_train_df[:2000],new_train_df[2001:4000]]new_train_df=pd.concat(frames)#训练集new_train_df=new_train_df.fillna('')new_test_df=new_train_df[:]#测试集同样的我们要对测试集也进行相应的划分,这里的测试集是没有标签的:
#数据切分defcut_test_set(text_list):cut_text_list=[]cut_index_list=[]fortextintext_list:temp_cut_text_list=[]text_agg=''iflen(text) 五折划分数据(可选) fromsklearn.model_selectionimportKFoldtrain_text_list=train_df['text'].values[:,None]train_label_list=train_df['unknownEntities'].values[:,None]kf=KFold(n_splits=5)fortrain_index,dev_indexinkf.split(train_text_list):train_x,dev_x=train_text_list[train_index],train_text_list[dev_index]train_y,dev_y=train_label_list[train_index],train_label_list[dev_index]验证切分是否正确: """测试切分是否正确"""flag=Truefori,textinenumerate(train_cut_text_list):label_list=train_cut_label_list[i].split(';')forliinlabel_list:iflinotintext:print(i)print(li)print(text)flag=Falseprint()breakifli=='':print(li)print(text)flag=Falseprint()ifflag:print("训练集切分正确!")else:print("训练集切分错误!")flag=Truefori,textinenumerate(dev_cut_text_list):label_list=dev_cut_label_list[i].split(';')forliinlabel_list:iflinotintext:print(i)print(li)print(text)print()flag=Falseifflag:print("验证集切分正确!")else:print("验证集切分错误!")