大数据测试,说来进入这块领域也快2年半了。每天工作的内容是验证数据表的逻辑正确性。 最近偶有所思,数据测试能否更进一步?如何利用已有技能对海量数据进行全面分析,找出数据质量问题或协助数据分析师发现逻辑漏洞? 再或者,能否向数据分析师转型呢?想得很多,思绪有些杂乱。于是我冷静了下,不再空想。我先做点东西出来看看,再评估下自己是否有这个能力和资质。 花了1个星期的时间,学习了Python的Pandas模块,按照学习示例一边学习一边实操,慢慢地感觉就来了。对Pandas有了基本的认知后,我在寻找一个突破点,我想我不能一直只是这样按照示例代码敲下去,毫无意义。 我得将所学的Pandas知识结合公司现有的业务进行运用。刚开始至少能简单地开始对某张数据表的某个指标进行数据分析。于是我按照这样的想法对test。testresv001roomdailydf表的numberofroomnights指标开始了数据分析的探索。 1、hivesql数据准备testsqlselecthotelcodenew,substr(staydate,1,7)asstaydate,sum(numberofroomnights)asroomntsfromtest。testresv001roomdailydfwheredt20210831andhotelcodenewin(CNSZV002,CWH,CWSW,ESL,FIJ)andresvstatusin(CHECKEDIN,CHECKEDOUT)andsubstr(staydate,1,7)in(201901,201902,201903,201904,201905,201906,201907,201908,201909,201910,201911,201912)groupbyhotelcodenew,substr(staydate,1,7)orderbyhotelcodenew,substr(staydate,1,7) hivesql内容说明: 从上面的hivesql语句可以看出,这条sql的目的是查询出hotelcodenew为CNSZV002,CWH,CWSW,ESL,FIJ在2019年各个月份的numberofroomnights指标总和,按照hotelcodenew和月份作分组和排序。 2、代码实现fromimpala。dbapiimportconnectimportwarningsimportpandasaspdimportmatplotlib。pyplotasplthivesqltestsql1中数据准备的hivesql语句defhiveconnect(sql):warnings。filterwarnings(ignore)confighivebeta{host:10。7。89。01,hive的host地址port:10000,hive的端口号user:hive001,hive的usernamepassword:hive001,hive的passworddatabase:tmp,hive中需要查询的数据库名authmechanism:PLAINhive的hivesite。xml配置文件中获取}connconnect(confighivebeta)cursorconn。cursor()cursor。execute(sql)hivealldatacursor。fetchall()returnhivealldataalldatahiveconnect(testsql)CNSZV002dt〔〕CWHdt〔〕CWSWdt〔〕ESLdt〔〕FIJdt〔〕CNSZV002data〔〕CWHdata〔〕CWSWdata〔〕ESLdata〔〕FIJdata〔〕foriinrange(len(alldata)):ifalldata〔i〕〔0〕CNSZV002:CNSZV002data。append(alldata〔i〕〔2〕)CNSZV002dt。append(alldata〔i〕〔1〕)elifalldata〔i〕〔0〕CWH:CWHdata。append(alldata〔i〕〔2〕)CWHdt。append(alldata〔i〕〔1〕)elifalldata〔i〕〔0〕CWSW:CWSWdata。append(alldata〔i〕〔2〕)CWSWdt。append(alldata〔i〕〔1〕)elifalldata〔i〕〔0〕ESL:ESLdata。append(alldata〔i〕〔2〕)ESLdt。append(alldata〔i〕〔1〕)elifalldata〔i〕〔0〕FIJ:FIJdata。append(alldata〔i〕〔2〕)FIJdt。append(alldata〔i〕〔1〕)i0apd。MultiIndex。fromtuples(〔(CNSZV002,x)forxinCNSZV002dt〕,names〔dt,〕)i0bpd。MultiIndex。fromtuples(〔(CWH,x)forxinCWHdt〕,names〔dt,〕)i0cpd。MultiIndex。fromtuples(〔(CWSW,x)forxinCWSWdt〕,names〔dt,〕)i0dpd。MultiIndex。fromtuples(〔(ESL,x)forxinESLdt〕,names〔dt,〕)i0epd。MultiIndex。fromtuples(〔(FIJ,x)forxinFIJdt〕,names〔dt,〕)df0apd。DataFrame(CNSZV002data,indexi0a)df0bpd。DataFrame(CWHdata,indexi0b)df0cpd。DataFrame(CWSWdata,indexi0c)df0dpd。DataFrame(ESLdata,indexi0d)df0epd。DataFrame(FIJdata,indexi0e)dfpd。concat(〔df0a,df0b,df0c,df0d,df0e〕)dfdf。unstack(dt)。fillna(0)df。columnsdf。columns。droplevel()print(df) 3、hive数据库alldata的数据结构查询结果alldata〔(CNSZV002,201906,2353),(CNSZV002,201907,2939),(CNSZV002,201908,5148),(CNSZV002,201909,3850),(CNSZV002,201910,4973),(CNSZV002,201911,5467),(CNSZV002,201912,4742),(CWH,201901,11023),(CWH,201902,9824),(CWH,201903,13737),(CWH,201904,14603),(CWH,201905,14061),(CWH,201906,12839),(CWH,201907,14638),(CWH,201908,14457),(CWH,201909,11648),(CWH,201910,14387),(CWH,201911,13758),(CWH,201912,11461),(CWSW,201901,5914),(CWSW,201902,4434),(CWSW,201903,6003),(CWSW,201904,6611),(CWSW,201905,6586),(CWSW,201906,5840),(CWSW,201907,6624),(CWSW,201908,7001),(CWSW,201909,5792),(CWSW,201910,6898),(CWSW,201911,6944),(CWSW,201912,5404),(ESL,201901,11008),(ESL,201902,11605),(ESL,201903,14493),(ESL,201904,12231),(ESL,201905,13571),(ESL,201906,12307),(ESL,201907,13777),(ESL,201908,12866),(ESL,201909,13276),(ESL,201910,13223),(ESL,201911,14580),(ESL,201912,13050),(FIJ,201901,5855),(FIJ,201902,2660),(FIJ,201903,3511),(FIJ,201904,7763),(FIJ,201905,7254),(FIJ,201906,10641),(FIJ,201907,11297),(FIJ,201908,11672),(FIJ,201909,10737),(FIJ,201910,11867),(FIJ,201911,10042),(FIJ,201912,8412)〕 4、代码实现2中的print(df)输出结果dtCNSZV002CWHCWSWESLFIJ2019010。011023。05914。011008。05855。02019020。09824。04434。011605。02660。02019030。013737。06003。014493。03511。02019040。014603。06611。012231。07763。02019050。014061。06586。013571。07254。02019062353。012839。05840。012307。010641。02019072939。014638。06624。013777。011297。02019085148。014457。07001。012866。011672。02019093850。011648。05792。013276。010737。02019104973。014387。06898。013223。011867。02019115467。013758。06944。014580。010042。02019124742。011461。05404。013050。08412。0 手工校对通过,与hivesql输出结果一致。 5、将dataframe数据结构的df数据使用plot生成趋势图dfdf。cumsum()plt。figure()df。plot()df。plot(kindbar)plt。legend(locbest)plt。show() 调用df。plot()不带任何参数的趋势图如下: 上述折线图表示:当前月份值及历史月份值的累加和。 调用df。plot(kindbar)时加上参数kindbar的趋势图如下: 上述柱状图表示:当前月份值及历史月份值的累加和。 两个图只是展示形式上的区别,都能在一定程度上体现2019年12个月份每个不同hotelcodenew当前月份与历史月份numberofroomnights值的累加和的数据分布情况,可以说是一个简单的数据分析。 6、将dataframe数据写入csv文件print(df)输出结果为dataframe数据类型df。tocsv(roomnts。csv) roomnts。csv内容如下: 7、读取csv文件中dataframe数据roomntspd。readcsv(roomnts。csv)print(roomnts)Unnamed:0CNSZV002CWHCWSWESLFIJ02019010。011023。05914。011008。05855。012019020。09824。04434。011605。02660。022019030。013737。06003。014493。03511。032019040。014603。06611。012231。07763。042019050。014061。06586。013571。07254。052019062353。012839。05840。012307。010641。062019072939。014638。06624。013777。011297。072019085148。014457。07001。012866。011672。082019093850。011648。05792。013276。010737。092019104973。014387。06898。013223。011867。0102019115467。013758。06944。014580。010042。0112019124742。011461。05404。013050。08412。0 8、将dataframe多维数据存储到excel中df。toexcel(roomnts。xlsx,sheetnameroomnts) roomnts。xlsx文件中sheetname为roomnts的内容如下: 9、从excel中读取dataframe多维数据excelreadresultpd。readexcel(roomnts。xlsx,sheetnameroomnts,indexcolNone,navalues〔NA〕)print(excelreadresult)Unnamed:0CNSZV002CWHCWSWESLFIJ0201901011023591411008585512019020982444341160526602201903013737600314493351132019040146036611122317763420190501406165861357172545201906235312839584012307106416201907293914638662413777112977201908514814457700112866116728201909385011648579213276107379201910497314387689813223118671020191154671375869441458010042112019124742114615404130508412 小结 今天分享的数据分析内容比较基础,主要是将学到的技能与业务相结合的初步探索,后续还需要不断探索与学习,将学习到的技能加以思考并运用到实际项目业务中,如此方能走得更远。