|
马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有帐号?立即注册
x
MySQL最初的开发者的意图是用mSQL和他们自己的快速低级例程(ISAM)去连接表格。经过一些测试后,开发者得出结论:mSQL并没有他们需要的那么快和灵活。数据|成绩Althoughtherearevariousapproachestodataminingthatseemtoofferdistinctfeaturesandbenefits,manymaynotbepowerfulenoughtomeetyourcorporateknowledgediscoveryneeds.Butinfactjustafewfundamentalquestionscanquicklyclarifythebusinessbenefitsandthepowerofadataminingsystem,settingitsadvantagesinaclearperspective.Thesequestionsneedtobeaskedbothfromtheviewpointsofbusinessandtechnicalusers.However,pleasenotethatthesequestionsrefertodatamining--pleasealsoseethemanybenefitsoftheknowledgeaccessparadigmwhichusesthepatternsdiscoveredbydataminingwithinaPatternWarehouseTM.Herearetwosetsof"TopTenDataMiningQuestions"frombusinessandtechnicalperspectives.Eachquestionhasthreepartsthattogetherhighlightonespecificaspectofadataminingsystemspowerandcapability.TheTopTenDataMiningBusinessQuestionsThetoptenbusinessquestionshouldbeaskedbybusinessusersaboutthebenefits,qualityandusabilityofthesystem.Theyare:Question1:BusinessBenefitsa)Howwillthissystemhelpus?b)Howwelldoesthissystemworkforourindustry-specificapplications?c)Whatinformationcanwegetthatwedonotalreadyhave?Itisessentialtoaskthisquestionagainandagain.Youshould,ofcourse,getnewrefinedinformation,butitisnotenoughjusttoknowsomething--youshouldhaveinformationthatallowsyouto"act"withinthecontextofyourindustry.And,youshouldmeasurethebottom-linedollarbenefitsdeliveredbyadataminingsystem.Seethepaper"MeasuringtheDollarValuefMinedInformation"foraframeworkforthis.Question2:TechnicalKnow-howa)Howtechnicallysophisticateddoweneedtobetouseit?b)CanbusinessusersoperateitwithoutcallingtheISgroupallthetime?c)Isitaseasytouseasaninternetbrowser?Businessusersshouldbeempoweredwithdirect,on-demandaccesstorefinedknowledge.Theyshouldnothavetoknowstatistics,yetshouldbegivenconsistentandcorrectanswers.Thesysteminterfaceshouldbeaseasytouseasaweb-browser.Question3:UnderstandabilityandExplanationsa)Aretheresultsintuitiveordifficulttounderstand?b)Dowegetclearexplanationsforanyinformationitempresented?c)Willtheexplanationsbeintechnicalstatisticaltermsorinaformthatwecanunderstand?ResultsshouldbepresentedtobusinessusersinplainEnglish,accompaniedwithgraphs.Thesystemshouldbeabletoexplaineachpieceofinformationitpresentsinclear,English-liketermsthatbusinessuserscaneasilycomprehendanduse.Question4:Follow-upQuestionsa)Whatkindsoffollow-upquestionscanweaskfromthesystem?b)Doweneedtogotoananalystforfurtherquestionanswering?c)Howfastcanwedrill-downontheflytoseemorepatterns?Responsetofollow-upquestionsmustbeimmediate.Businessusersshouldnotneedtouseintermediariessuchasanalyststogetmoreinformationaftertheyhaveseensomeresults.Iffollow-upquestionstaketimeandinvolveintermediaries,thebusinessuserseffectivenesswillbeimpacted.Businessusersshouldgetrefinedinformation,astheyneedit,whentheyneedit.Question5:BusinessUsersa)Howmanybusinessuserscanthissystemsupport?b)Canthebusinessuserstailortheirownquestionsforthesystem?c)Canusersutilizetheknowledgeforday-to-daydecisionmaking?Thesystemshouldbeabletousethesamefundamentalknowledgetosupportafewhundredbusinessusers,eachwithadifferentgroup-perspective.Yet,alloftheseusersmustbegivenconsistentanswersastheyasktheirownquestions.Theinformationmustbepresentedsuchthatcanbeutilizedforday-to-dayactions.Question6:Accuracy,CompletenessandConsistencya)Howaccuratearetheresultsthesystemdelivers?b)Cansomepatternsbemissedbythesystem?c)Aretheresultsalwaysconsistentorcan100usersget100differentanswers?Thesystemmustcoverawiderangeofpatternsandshouldprovidehighquality,information.Theknowledgeprovidedtobusinessusersshouldbederivedfromtheentiredataset(andnotsamples)inordertoincreaseaccuracy.Allbusinessusersshouldaccessthesameknowledgesothattheyallreceiveconsistentanswers,increasingthequalityofcorporateinformation.Question7:IncrementalAnalysisa)Canweautomaticallyanalyzeweekly/monthlydataasitbecomesavailable?b)Canthesystemcomparethe"monthtomonth"resultsandpatternsbyitself?c)Canwegetautomaticpatterndetectionovertime,everyweekormonth?Thesystemshouldanalyzedataasitbecomesavailableeveryweekormonthandperformon-goingtrendanalysis,highlightingthekeyitemsandinfluencefactorsthatimpactsignificantchanges.Theincrementalanalysisshouldbeperformedautomaticallyinthebackground,informingtheuserofsignificanttrendsandtheunderlyingcauses.Question8:DataHandlinga)Howmuchdatacanthesystemdealwith?b)Canitworkdirectlyonourdatabase,ordoweneedtoextractdata?c)Ifitworksonextracts,howdoweknowthatsomepatternsarenotmissed?Thesystemshouldhandlemoderatetolargevolumesofdataonapowerfulserver--ofcourse,largedatavolumesshouldnotbeexpectedtobemanagedonsmallservers.ThesystemshouldworkdirectlyontheSQLdatabase,withoutextractssothatpatternsarenotmissedandperformanceisimproved.Question9:Integrationa)Howwillitintegrateintoourcomputingenvironment?b)WillitjustworkonourexistingSQLdatabase?c)Howeasilywillthesystemworkonourintranet?Thesystemshouldrunsmoothlyonexistingopenserverplatforms(e.g.Unix)andpopularDBMSengines(e.g.Oracle,SybaseInformix,etc.)ontheserver.Thesystemshouldpresentresultstousersonthecorporateintranet.Theabsenceofdataconditioningrequirementsandextractfileswillmakeintegrationmucheasier.Question10:SupportStaffa)WhatstaffdoIneedtokeepthissysteminstalledandrunning?b)Howdowegetsupportandtrainingtogetstarted?c)Whathappensafterweinstallthesystem?Aftertheinitialsystemdesign,thesupportpersonnelforthesystemshouldbekeptminimal.OnedatabaseadministratorshouldbeabletomanagetheDBMS,andoneanalystshouldoccasionallyhelpinsettingupdiscoverymodels,etc.Thereafter,businessusersshouldbeabletousethesystemontheirown.Thereshouldbenoneedforalargenumberofresidentsupportanalysttoactasintermediariesforthebusinessusers.TheTopTenDataMiningTechnicalQuestionsThetoptentechnicalquestionshouldbeaskedbytechnicalusersaboutthearchitecture,powerandthescalabilityofthesystem.Theyare:Question1:Architecturea)Howarecomputationsdistributedbetweentheclientandtheserver?b)Isanydatabroughtfromtheservertotheclient?c)Canthesystemruninathreetieredarchitecture?Thebestoptionisforthediscoverytotakeplaceentirelyontheserver.Anyattempttobringdatatotheclientwillseriouslylimittheapplicabilityofthesystemtolargerdatabases.Thebestarchitectureisathin-client,three-tieredsystemthatusesthepowerofalargeserver-basedSQLenginebutoperatesonanintranet.Question2:AccesstoRealDataa)DoesthesystemworkontherealSQLdatabaseoronsamplesandextracts?b)Ifitsamplesorextracts,howdoweknowthatitisaccurate?c)Ifitbuildsflatfiles,whomanagesthisactivityandcleansupforon-goinganalyses,andhowcanitsampleacrossseveraltables?Thebestoptionisforadataminingsystemtoworkontherealdatabasesandnotonsamples,extractsand/orflatfiles.WorkingontherealdatabaseusestheSQLenginespower(e.g.parallelexecution)andprovidemuchmoreaccurateresults.And,thesystemshouldbeabletoaccessdatabasetablesintheirnativeform,reachingacrosstablesbyitself.Question3:PerformanceandScalabilitya)Howlargeofadatabasecanthesystemanalyze?b)Howlongdoesittaketoperformdiscoveryonalargedatabase?c)Canthesystemruninparallelonamulti-processorserver?Thesystemshouldworkondatabaseswithalargenumberofrecords.ItshouldderiveitscapabilitiesfromthepoweroftheserverandtheSQLengine,wheneverpossible.Thesystemshouldbeabletousethebuilt-inparallelismoftheSQLengine,butshouldalsobeabletousemultipleprocessorsforitsownparallelnon-SQLcomputations.Question4:Multi-TableDatabasesa)Doesthesystemworkonasingletableonlyorcanitanalyzemultipletables?b)Doesthesystemneedtoperformahugejointoaccessallofourtables?c)Ifitworksonasingletable,howcanwefeeditourexistingdataschema?Therealworldisfullofmulti-tabledatabaseswhichcannotbejoinedandmeshedintoasingleview.Infact,thetheoryofnormalizationcameaboutbecausedataneedstobeinmorethanonetable.Usingsingletablesisanaffronttoadecadeofworkondatabasedesign.IfyouchallengetheDBAofareallylargedatabasetoputthingsinasingletableyouwilleithergetalaughorablankstare--inmanycasesthedatabasesizewillballoonbeyondcontrol.Thesystemshouldbeabletominelargemulti-tabledatabasesdirectlybyitselfontheserver.Question5:Multi-DimensionalAnalysisa)Doesthesystemanalyzedataalongasingledimensiononly?b)Howaremulti-dimensionalpatternsdiscoveredandexpressedbythesystem?c)Howdowespecifythedimensionalstructureofourdatatothesystem?TheOLAPphenomenonhasconclusivelydemonstratedthatthebusinessworldsdataisnotsingle-dimensional.Henceadataminingsystemshouldbeabletoautomaticallydiscoverpatternsalongmultipledimensions.Infact,therearemanycaseswherenosingledimensionalviewcancorrectlyrepresentthesemanticsofinfluencebecausetheinfluenceratioswillalwaysbeoffregardlessofhowoneaggregates.Seethepaper:OLAP&DataMining:BridgingtheGapforadetaileddiscussionofthis.Question6:TypesandClassesofPatternsDiscovereda)Howpowerfulandgeneralarethepatternsthesystemcandiscoverandexpress?b)Canthesystemmixdifferentpatterntypes,e.g.influenceandaffinitypatterns?c)Canthesystemdiscovertime-basedpatternsandtrends?Theformatofthepatternsdiscoveredbythesystemisverygeneralandgoesfarbeyonddecisiontreesorsimpleaffinities.Theadvantagetothisisthatthegeneralrulesdiscoveredarefarmorepowerfulthandecisiontrees.Decisiontreesareverylimitedinthattheycannotfindalltheinformationinadatabase.Beingrule-basedkeepsthesystemfrombeingconstrainedtoonepartofasearchspaceandmakessurethatmanymoreclustersandpatternsarefound--allowingthesystemtoprovidemoreinformationandbetterpredictions.Question7:SystemInitiativea)Doesthesystemuseitsowninitiativetoperformdiscoveryorisitguidedbytheuser?b)Canthesystemdiscoverunexpectedpatternsbyitself?c)Canthesystemstart-upbyitselfonaweeklyormonthlybasisandperformdiscovery?Insomecasestheuserhastointeractandguidethesystem,e.g.buildadecisiontree.However,abetterapproachisforthesystemtouseitsowninitiativeinthedataminingprocess,forminghypothesisautomaticallybasedonthecharacterofthedata.Thesystemshouldstart-upbyitself,selectthesignificantpatternsinthedataandfiltertheunimportanttrends.Theanalysesshouldbedoneroutinelyonaweeklyormonthlybasis.Question8:TreatmentofDataTypesa)Arealldatatypeshandledintheirownformortranslatedtoothertypes?b)Canthesystemfindnumericrangesindatabyitself?c)Doalargenumberofnon-numericvaluescauseproblemsforthesystem?Thesystemshouldmanagealldatatypesinauniformmannerandintheirnativeformats,i.e.numbers,datesandconstantsshouldremainnumbers,datesandconstantsinternally.Interestingrangesinthedatashouldbediscoveredbythesystem,notrequiring"numberbin"constructionbytheuser.Alargenumberofconstantvaluesinthedatabaseshouldnotchokethesystem.Question9:DataDependenciesandHierarchiesa)Canthesystembetoldaboutthefunctionaldependenciesinourdatabase?b)Doesthesystemunderstandtheconceptofdatahierarchy?c)Howdoesthesystemusedependenciesand/orhierarchiesfordiscovery?Thesystemshouldbecapableofusingthefunctional(andotherdependencies)thatexistinadatabase.Theuseofthesedependenciescansignificantlyenhancethepowerofadiscovery--infactignoringthemcanleadtoconfusion.Thesystemshouldunderstandtheconceptofhierarchyandshouldbeabletouseitfordiscoveryalongmultipledimensions.Question10:FlexibilityandNoiseSensitivitya)Howbrittleisthesystemwhendealingwithnoisydata?b)Howwelldoesthesystemcopewithdataexceptionsandlowqualitydata?c)Canthesystemprovidestatementswithflexiblenumericrangesdiscoveredbyitselfinthedata?Thesystemshouldnotbesensitivetonoiseandshouldinternallyusefuzzylogictosmoothdatabrittleness.Asthedatagathersnoise,thesystemshouldonlyreducethelevelonconfidenceassociatedwiththeresultsprovided,notsuddenlychangedirectionindiscovery.However,thesystemshouldstillproducethemostsignificantfindingsfromthedataset,evenifnoiseispresent.索引是一种特殊的文件(InnoDB数据表上的索引是表空间的一个组成部分),它们包含着对数据表里所有记录的引用指针。索引不是万能的,索引可以加快数据检索操作,但会使数据修改操作变慢。每修改数据记录,索引就必须刷新一次。 |
|