Use PHP to enable web data analysis into a higher realm (PHP is similar to flow analysis)

xiaoxiao2021-03-06  75

Use PHP to enable web data analysis into a higher realm

Author: Paul Meagher Joined: 2003-12-07 Views: 129 Design your data analysis, to do more things than simple raw counts

Effective and multi-level analysis for web data is a key factor that can survive for web companies. Design (and decisions) of data analysis tests are usually working in system administrators and internal application designers, and they may be In addition to the original count, there is no more understanding of statistics. In this article, Paul MeaGher tends to the web developers to deliberate the skills and concepts required to apply statistics to web data streams.

Dynamic website constantly generates a large amount of data - access log, public opinion, and survey results, customer summary information, order and others, Web developers do not only create applications that generate these data, but also develop significance to these data Applications and methods. Typically, the response of web developers is not enough for growing data analysis requirements generated by management sites. In general, in addition to reporting various descriptive statistics, Web developers do not have other better ways to reflect data flow characteristics. There are many inference statistics (methods based on the overall parameters of the sample data) can be fully utilized, but they do not apply them. For example, the web access statistics (according to the currently edited) is only the frequency count of packets in various ways. The situation in which polls and survey results in the original count and percentage are all compared to. The statistical analysis of developers handling data streams with a relatively shallow approach may be enough, we should not expect too much. After all, there are professionals who have more complex data stream analysis; they are statistics and training analysts. When organizations need not only to describe sexual statistics, they can be added. But another should be aware of a part of the WEB developer's work description to acknowledging the increasingly deepening of inference statistics. Dynamic sites are generating more and more data, and the facts indicate that trying to turn these data into useful knowledge is the responsibility of web developers and system administrators. I advocate the latter response; this article aims to help web developers and system administrators learn (or revisit, if knowledge has been forgotten) will inject design and analysis skills required to apply to Web data. The Web Data and Experimental Design Related to Inspired Statistics to Web Data Flows Not just learning as a mathematical knowledge of various statistical inspection bases. The ability to associate data collection processes with key differences in the experimental design: What is measurement scale? What is the representation of the sample? What is the overall? What is the assumption is being inspected? It is necessary to apply statistics to web data streams, and you need to see the results as an experimental design; then choose the analysis process for this experiment design. Even if you may think that the Web public opinion and access log data as the experiment is more, it is really important. why? 1. This will help you choose the appropriate statistical inspection method. 2. This will help you get appropriate conclusions from the collected data. An important aspect of the experimental design is an important aspect of the experimental design to select the measure scale of the data collection. The measurement scale of the measurement of the standard is only a step of assigning a symbol, letters or numbers that are interested in interested. For example, the kilogram scale allows you to assign numbers to an object, indicating the weight of the object based on the standardized offset of the measuring instrument. There are four important metrics: ratio scale (Ratio) - a MK scale is an example of a ratio scale? The symbol assigned to the object property has numeric significance. You can perform various operations (such as calculation ratios) for these symbols, and you cannot use these computments for values ​​obtained by using the useful features. The distance (interval) - in the sleet scale, the distance between the two adjacent measurement units (also referred to as a spacing) is equal, but the zero point is arbitrary. Examples of the sleet scale include metrics for longitude and tidal height, as well as a measure of different years. The value of the sleeve scale can be added, but the multiplication is meaningless. The sequencing scale (RANK) - the sequencing scale can be applied to a set of sequential data, and in order, it is a value and observation value belonging to the scale or the rating scale. Common examples include "good evil" public opinion, which assign numbers to individual properties (from 1 = very disgusted to 5 = very like).

Typically, a class of ordered data has a natural order, but the gap between adjacent points on the scale does not have to always be the same. For sequential data, you can count and sort, but you can't measure. Nominal - Measurement of Class Class is the weakest form of measurement, mainly means assigning projects to groups or categories. This measurement does not have a quantity, and does not mean sorting the project. The main numerical operation of the settlement scale data is the frequency count of each category. The following table compares the characteristics of each measure: Is the measurement scale property with an absolute number of numbers? Can most mathematical calculations? The ratio scale is. Yes. The sleeve scale is such that the zero point is arbitrary. Adding minus. The sequence scale is not. Count and sort. The class is not. Only count. In this article, I will focus on data collected by using measurements, and inference applications for customized data. Almost all web users with custom class - designers, customers, and system administrators - are familiar with the class scale. Web public opinion and access logs are similar because they often use settlement scale as a measure of measure. In web public opinion, users often use people to choose to answer options (such as "Do you prefer brand A, Brand B, or brand C?") To measure people's preferences. Summarize the data by counting the frequency of each type of answer. Similarly, the common method of measuring website traffic is to divide each click or access in a week, then count the number of clicks or access to each day. In addition, you can (also you can) via the browser type, the operating system type, and the country or region where the visitor is located - and any classification scale you want to get. Because the web public opinion needs to count the number of times the data is classified into a particular nature category, it can be analyzed with a similar non-parameter statistical test (allow you to be inference according to the distribution shape instead of the overall parameters). they. David sheeskin is in his Handbook of Parametric and Non-Parametric Statistical Procedures (p. 19, 1997), this is the case of parameter inspection and non-parameter test: the process is classified as parameter inspection and non-parametric inspection. The difference is based primarily on the measurement level represented by the analytical data. As a general rule, evaluation category / class scale data and sequential / level-sequential data is classified as a non-parameter test, and those tests that evaluate the length of the scale data or the ratio scale data are classified as parameters. test.

When some assumptions as a parameter verification basis are worthy of doubt, non-parameter test is also useful; when the parameter assumption is not satisfied, the non-parametric test has a large role in detection overall differences. For examples of web public opinion, I use non-parameter analysis process because web public opinion usually uses a class scale to record the ballotians' preference. I am not suggested that web public opinion and web access statistics should always use the custom class scale measurement standard, or non-parameter statistics is the only way to analyze such data. It is not difficult to imagine that there is a public opinion test and investigation such as (such as), which requires the user to provide numerical scores (from 1 to 100) for each option, and the statistical inspection of the parameter is relatively appropriate. Nevertheless, many web data streams include editing category count data, and by defining a creative scale (such as from 17 to 21) and assign each data point to a sleet scale (such as "young people"), these data can be (By using a more powerful measurement standard measurement) becomes a settling scale data. The universal existence of frequency data (already part of Web Developer's experience), making it a good starting point for non-parameter statistics to learn how to applying inference technology to data streams. In order to keep this article reasonable space, I will limit the discussion of Web data stream analysis to web public opinion. But keep in mind that many web data streams can be represented by custom class counts, and the inference technology I discussed will allow you to do more than reporting simpler counting data. From sampling, you assume you on your site www.novascotiabeerdrinkers.com to perform a weekly public opinion, ask members to see the opinions of various topics. You have created a public opinion, asked member's favorite beer brands (there are three well-known beer brands in New Scotia (Nova Scotia): Keiths, Olands and Schooner. In order to make the survey as wide as possible, you include "other" in your answer. You receive 1,000 answered, please observe the results in Table 1. (The results shown in this article are only used as the demo, not based on any actual investigation.) Table 1. Beer public opinion KEITHS OLANDS SCHOONER Other 285 (28.50%) 250 (25.00%) These Data seems to support this conclusion: Keiths is the most popular brand that is most popular by residents of New Coso. According to these numbers, can you get this conclusion? In other words, are you inference based on the result of the Wesko's beer consumers in New Sko, based on the results obtained from samples? Many factors related to sample collection methods are incorrectly inferior to relatively popularity. Possible samples contain employees of too many keiths brewery; maybe you do not fully prevent multiple tickets, and this person may make the result of deviation; may be selected to vote and vote without being selected Different people; maybe the voters in the Internet are different from the voters who are not online. Most web public opinions have difficulties in these explanations. This interpretation is difficult when you try to get the conclusions of overall parameters from sample statistics. From the perspective of experimental design, a question first asked before collecting data is: Can you take a step to help ensure that the sample can represent the overall studies.

If the overall concluded of the study is the motivation for you to do web public opinion (instead of recreation for site visits), then you should implement some techniques to ensure one vote (so, they must use the only identity Log in to vote) and make sure the random selection voter sample (for example, randomly select the subset of members, then give them an electric email, encourage them to vote). In the end, the goal is to eliminate (at least reduced) various deviations, which may weaken the ability to conclusions to the overall conclusions. Inspection assumptions, it is assumed that there is no deviation of the sample of beer consumer in New Scotha Province. Can you now get KEITHS? To answer this question, consider a related question: If you have to get another sample of the beer consumer of New Scotha, do you want to see the exact same results? In fact, you will want to change a certain change in the results observed in different samples. Consider this expected sampling variability, you may suspect whether the random sampling variability is better than the actual difference in the overall differences in the study. In statistical terms, this sampling variability illustrates a false hypothesis. (Frequency Sign HO Representation) In this example, use the formula to express it into such a statement: in all categories of the answer, the desired number of various answers is the same. HO: # keiths = # Olands = # Schooner = # = # = # = # If you can eliminate false, then you have made some progress on the initial issue of Keiths. Then, another acceptable hypothesis is in the overall investigation, the proportion of various answers is different. This "first inspection of false" logic is available in multiple phases in public opinion data analysis. Excluding this virtual hypothesis, so data will not completely different, then you can continue to verify a more specific false, ie Keiths and Schooner, or there is no difference between Keiths and other brands. You continue to test your false, not directly evaluate another assumption, is because it is easier for statistical modeling for things that people want to observe under false conditions. Next, I will demonstrate how to model things expected under falseness, so I can compare the observations with the results expected under false conditions. Modeling for false assault: X-square distribution statistics So far, you have already used a report that reports each Answer option frequency count (and percentage) summary Web public opinion results. To verify the false setting (there is no difference between the table unit frequency), calculating the overall deviation metric of each table unit with your expected value in a false condition. In this example of this beer welcome, the desired frequency under the virtual conditions is as follows: Desire frequency = Number of observation / answer options = 1000/4 Expectation frequency = 250 To calculate the content replied in each unit How much is the overall degree of difference with the expected frequency, you can take all differences to a total metric that reflects the difference between the observation frequency and the desired frequency: (285 - 250) (250 - 250) (215 - 250) ( 250 - 250). If you do this, you will find that the desired frequency is 0, because the average deviation is always 0. To solve this problem, the square of all differences should be taken (this is the origin of the square of the X-square distribution).

Finally, in order to make each sample (these samples having different observations) have comparable (in other words, it is standardized), the value is divided by the desired frequency. Therefore, the formula of X-square distribution statistics ("O" represents "Observation Frequency", "e" is equal to "desired frequency"): Figure 1. X-square distribution statistics Formula If calculating the beer welcomes public opinion quiz x The square distribution statistics will result in a value of 9.80. To verify the false, you need to know the probability of such a limit value if there is a random sampling variability. To get this probability, what is the sample distribution of X-square distribution is needed. The sampling distribution of X-square distribution is observed Figure 2. X Flat Scales In each figure, the horizontal axis represents the resulting X-square distribution value size (from 0 to 10 from 0 to 10). The vertical axis displays the probability of each X square distribution value (or a relative frequency called the occurrence). When you study these X square score maps, please note that when you change the degree of freedom (ie DF) in the experiment, the shape of the probability function changes. For examples of public opinion data, the freedom is calculated: note the number of answer options (k) in public opinion, then use this value to reduce 1 (DF = K - 1). Typically, when you increase the number of answers options in the experiment, the probability of obtaining a larger X square distribution value will drop. This is because when the answer option is added, the number of variances is added - (Observe - expected value) 2 - You can ask for its total. Therefore, when you increase the answer option, the statistical probability of obtaining a large X-square distribution value should be increased, and the probability of obtaining a smaller X square distribution value will be reduced. This is why the shape of the sampling distribution of the X-square distribution changes with the DF value. In addition, it is important to note that people are not interested in the decimal part of the X square distribution result, but is interested in the total part of the right-sized curve at the obtained value. The mantissa probability tells you that you get a limiting value that you observed is possible (such as a large mandate area) or it is impossible (small manda area). (In fact, I don't use these pictures to calculate the probability of the mission, because I can implement the mathematical function to return the probability of the mantissa of the given X square distribution value. I use this practice in the X square distribution program discussed later in this article.) To learn how these pictures are derived, you can see how to simulate the contents of the graph corresponding to DF = 2 (which represent k = 3). Imagine putting the numbers 1, 2 and 3 in the hat, shake, select a number, and record the selected number as an attempt. 300 trials were performed on this experiment, and then the frequency of 1, 2 and 3 appeared. Every time you do this experiment, you should expect the results with a slightly different frequency distribution, which reflects the variability of sampling, and this distribution does not really deviate from the possible probability range. The following Multinomial class implements this idea. You can initialize this class with lower value: Try the number of experiments, the number of times what you do, and the number of options for each trial. The results of each experiment are recorded in an array called Outcomes.

Listing 1. Multinomial class content nexps = $ nexps; $ this-> ntrials = $ ntrials; $ this-> noptions = $ noptions; for ($ I = 0; $ i < $ this-> NEXPS; $ I ) {$ this-> Outcomes [$ I] = $ this-> runexperiment ();}} function runexperiment ()} function runexperiment ()} function runexperiment ()} function runexperiment ()} function runexperiment ()} function runexperiment ()} function = array (); for ($ i = 0; $ i <$ this-> nexps; $ i ) {$ Choice = Rand (1, $ this-> noptions); $ OUTCOME [$ Choice] ;} RTURN $ OUTCOME;}}?> Please note that the runexperiment method is A very important part of this script that guarantees that the choice made in each experiment is random and tracks what options in the simulation experiment. In order to find the sampling distribution of X-square distribution statistics, simply obtain the results of each experiment, and calculate the X-square distribution statistics of the result. Due to the variability of random sampling, this X-square distribution statistics will vary with experiments. The following script writes the X-square distribution of each experiment to an output file for later represented by chart.

Listing 2. Write the X-square distribution statistics to the output file Outcomes [$ I]); // Load Obtained Chi Square Value INTO Sampling Distribution Array $ Distribution [$ I] = $ chi-> chisqobt; // Write Obtained Chi Square Value to Filefputs ($ OTPUT, $ Distribution [$ I]. "n");} fclose ($ output);? > In order to make the results expected to run the experiment, the simplest method is to load the data.txt file into the open source statistic package R, run the Histogram command, and edit the chart in the graphic editor. As shown below: x = scan ("Data.txt") HIST (X, 50) As you can see, these X-square distribution values ​​of these X-square distribution values ​​are approximately . Figure 3. A continuous distribution of DF = 2 In the following sections, I will focus on the working principle of the X-square distribution software used in this simulation experiment. Typically, the X-square distribution software will be used to analyze the actual settable scale data (such as web public opinion results, weekly traffic reports or customer brand preference reports), not the analog data you are using. You may also have other outputs generated by the software, such as aggregate tables and masonns - interested. X-square distribution instance variable I developed PHP-based X-square distribution software package consists of classes used to analyze frequency data, and frequency data is classified according to one-dimensional or two-dimensional (chisquare1d.php and chisquare2d.php). My discussion will only be limited to the working principle of Chisquare1D.php classes, and how to apply it to one-dimensional web public opinion test data. Before proceeding, you should explain that data is classified according to two dimensions (for example, classify the beer preferences in gender), allowing you to indicate your results by looking for system relationships or conditional probability in the listing unit.

Although many of the discussions will help you understand the working principle of Chisquare2D.php software, other experiments, analysis, and visualization issues not discussed herein are also necessary to be processed before using this class. Listing 3 studies the fragment of the Chisquare1D.php class, which consists of the following sections: one containing files 2. Class instance variable list 3. Segment with X-square distribution class with files and instance variables included inventory 3 The top of this script contains a file called distribution.php. The path contained in the init.php file in the init.php file, assume that the init.php file is included in the call script. The files contained in Distribution.php contain methods for generating sampling distribution statistics for several common sample distributions (T distribution, F distribution, and X square distribution). Chisquare1d.php class must be able to access the X square distribution method in distribution.php to calculate the probability of the obtained X-square distribution value. The instance variable list in this class is worth noting because they define the result objects generated by the analysis process. This result object contains all important details on the inspection, including three important x square distribution statistics - Chisqobt, ChisqProb, and Chisqcrit. For more information on how to calculate each instance variable, you can consult the constructor method of this class, all of which are originated there. Constructor: The trunk list 4 of the X square distribution test gives the X-square distribution constructor code, which constitutes the backbone of the X square distribution test.

Listing 4. X-square score constructor obsfreq = $ ssfreq; $ this-> express = $ express; $ this-> alpha = $ alpha; $ this-> Numcells = count ($ this-> obsfreq); $ this-> df = $ this-> Numcells - 1; $ this-> Total = $ this-> gettotal () ; $ this-> getExpFreq (); $ this-> chisqobt = $ this-> getchisqobt (); $ this-> chisqcrit = $ this-> getChisqcrit (); $ this-> chisqprob = $ This-> getChisqprob (); RETURN TRUE;}}?> The four aspects worth noting in the constructor is: 1. Constructor accepts an array composed of the observed frequency, the ALPHA probability breakpoint (CUTOFF Score) An array of optional expectations. 2. The top six rows involve relatively simple assignment and recorded calculated values ​​so that the complete result object can be used to call the script. 3. Finally, the four rows perform a lot of work for X-square distribution statistics, which are most interesting. 4. This class only implements X-square distribution inspection logic. There is no output method associated with this class. You can study the class methods included in this document to learn more about how to calculate each result object value (see Resources). The code in the Listing 5 of the Process Output is shown to show how easy X-square distribution analysis using Chisquare1d.php class is. It also demonstrates the processing of the output problem. This script calls a wrapper script called chisquare1d_html.php. The purpose of this packner script is to separate the logic of the X square distribution process is separated from its representation. The _html suffix indicates that the output is a standard web browser or other device that shows HTML. Another purpose of the packaging script is to organize the output in a manner that is easy to understand. In order to achieve this, the class contains two methods for displaying the results of X-square distribution analysis. The showtablesummary method shows the first output table (Table 2) shown later (Table 2), while showchisquarestats displays the second output table (Table 3).

Listing 5. Using wrapper scripting organization showtablesummary ($ headings) ?; echo "

"; $ Chi-> showChiSquareStats ();> the script generates the following output: the desired frequency and variance Keiths Olands Schooner table 2. run the wrapper script obtained other total observations 285 250 215 250 250 250 250 250 250 250 1000 Foundation 4.90 0.00 4.90 0.00 9.80 Table 3. Various X-square Distribution Statistics statistics DF Get Rate Probability Critical Value X Plact 3 9.80 0.02 7.81 Table 2 shows Desired frequency and variance (O - E) 2 / E of each unit The variance value is equal to the obtained X square distribution (9.80) value, this value is displayed in the lower right unit of the summary table. Table 3 reports various X square distribution statistics. It includes the degree of freedom used in the analysis and reports the obtained X square distribution value again. The obtained X square distribution value is rendered into a mantissa probability value - 0.02 in this example. This means that in a false condition, it is observed that the probability of X-square distribution limit 9.80 is 2% (this is a relatively low probability). If you decide to eliminate falseness - the result can be obtained according to the random sampling of zero distribution, then most statistics will not be controversial. Your public opinion results are more likely to reflect the true difference of beer consumers in Newzkord provinces for beer brand preferences. In order to confirm this conclusion, it can be compared with the threshold with the obtained X-square distribution value. Why is the threshold value? The critical value is built on an important level (ie, the alpha disconnected level) set for this analysis. The ALPHA discharge value is set to 0.05 according to the convention (the value is used above). This setting is used to find the sampling distribution of the X square score containing the mantissa area equal to the position (or threshold) of the Alpha discharge value (0.05). Herein, the obtained X square distribution value is greater than the critical value. This means that the threshold for keeping false settings will be exceeded. Another assumption - proportional differences in the overall proportion - in statistics may be more correct. In the automation analysis of the data stream, the ALPHA disconnect settings can be used for knowledge-discovery algorithms (such as chi Square Automatic Interaction Detection, Chi Square Automatic Interaction Detection, ChiAD), which is now discovered in real useful mode Will not guide people in detail. Another interesting application of re-conducting a single-way X-square-way distribution is re-conducting public opinion to understand whether people's answer has changed. After a period of time, you plan to conduct another Web public opinion on the beer consumers in New Scotha.

You once again asked their favorite beer brands, now observed the following results: Table 4. New Beer Policy KEITHS OLANDS SCHOONER Other 385 (27.50%) 350 (25.00%) 350 (25.00%) Old Data As shown below: Table 1. Old Beer Policy (once again) Keiths Olands Schooner Other 285 (28.50%) 250 (25.00%) 215 (21.50%) 250 (25.00%) The obvious difference between public opinion results is The first public opinion has 1,000 survey objects, while 1,400 survey objects have been used for the second time. The main impact of these additional investigations is that the frequency count of each answer increases by 100 points. When preparing to analyze new public opinion, the default method-calculates the desired frequency to analyze the data, and the expected probability of each result can be utilized to initialize the analysis. . In the second case, you put the previously obtained scale ($ EXPPROB), and use them to calculate the desired frequency value of each answer option. Listing 6 shows a beer public opinion and analysis code for detecting preferences: Listing 6. Changes in detection preference showTableSummary ($ Headings); echo "

"; $ Chi-> Showchisquarestats (); [?> Table 5 and 6 show the HTML output generated by beer_repoll_analysis.php script: Table 5. Outside the expected frequency and variance obtained by running beer_repoll_analysis.php KEITHS OLANDS SCHOONER Other Total Observations 385 350 315 350 1400 Expectations Value 399 350 301 350 1400 Various 0.49 0.00 0.65 0.00 1.14 Table 6. Various X-square distribution statistics statistics by running beer_repoll_analysis.php DF Get value probability threshold X-square score 3 1.14 0.77 7.81 Table 6 shows that under false conditions, The probability of obtaining X square distribution value 1.14 is 77%. We can't rule out such a false, that is, since the last public opinion, the new Skilla's beer consumer preference has changed. Any difference between the observation frequency and the desired frequency can be explained as the expected sampling variability of the same beer consumers in the city of Nova. Considering that the conversion of the initial public opinion results is only done by adding constants 100 to the front of each public opinion, then this zero is not an amazing place.

However, you can imagine that the results have changed, and it is envisaged that these results may imply that another brand of beer is becoming more popular (please note the variance size of each column of each column in Table 5). You can further imagine that this discovery has a significant meaning of the financial manifests of the winery discussed, because the bar boss tends to purchase the best-selling beer in the bar. These results will be checked by the brewer boss, and they will ask questions about the suitability of the analysis process and experimental methods; special, they will ask questions about the representation of the sample. If you plan to perform a web experiment, this experiment may have an important actual meaning, then you need to give an equal attention for experimental methods for collecting data and analyzing from data. Therefore, this article not only lays a good foundation for you to enhance your effective understanding of Web data, it also provides suggestions, these recommendations are related to how to protect your statistical test, and make it from data The conclusion is more reasonable. The knowledge learned in this article, you have learned how to infer the statistical application of frequencies of frequency data for aggregating the web data stream, focusing on the analysis of Web public opinion. However, the simple one-way X square distribution analysis process discussed can also be effectively applied to other types of data streams (access logs, survey results, customer profile information, and customer orders) to convert raw data into useful knowledge. When it is inference statistics to web data, I also introduced the results of wishing the data flow as a web experiment to inference to improve the possibility of reference experimental design considerations during prevention. Usually because you lack enough control for the process of data acquisition, you can't make inference. However, if you apply the experiment's design principles to the web data collection process (for example, randomly select vote during your web public opinion), then this situation can be changed. Finally, I demonstrated how to simulate the sampling distribution of X-square distribution of different degrees of freedom, and not only explain its source. In the process of doing so, for the desired frequency of the measurement category is less than 5 (in turn, the small N experiment) - I also demonstrated a work-in-one method (sample distribution using small $ ntrials value simulation experiment) to prohibit use X square distribution test. Therefore, I don't just use DF in the research process to calculate the probability of sample results. For a small number of attempts, I may also need to use the $ ntrials value as a parameter to find the probability of the observed X square distribution result. Consider how you might analyze a small N experiment is worthwhile, because you usually want to analyze your data before data acquisition is completed - When the cost of each observation is expensive, when the observation needs to spend a long time to get, Or just because you are very curious. When trying this level of Web data analysis, it is best to keep the following two questions: * Do you have a reason to preverate under small N? * Simulation helps you determine what infantry in these environments?

Download Attachment: Related Attachments: This article Source Code Download Arts Transfer from: http://www.phpe.net/articles/377.shtml

转载请注明原文地址:https://www.9cbs.com/read-110992.html

New Post(0)