Talk:Statistics

From Bvio.com

Jump to: navigation, search

This page is for discussion of the article about statistics. Comments and questions about the special page about Wikipedia site statistics (number of pages, edits, etc.) should be directed to Wikipedia talk:Special pages.


Contents

Miscellaneous

I was taught statistics starting with the definition "a statistic is a function of data" as the first sentence of the Part 1B Stats course at Cambridge. I think the definition was useful and so it should be included BozMo(talk). Done

On special:Statistics, what are 'junk pages'? They seem to equal total pages minus (non-talk comma pages + talk pages). How many of these are #REDIRECTs? --Damian Yerrick

Why is the Main Page article counter different than the one in Statistics? --Chuck Smith

It's been some number of years since I studied statistics, but the terms used throughout the article did ring some bells very quietly in the back of my mind. The singular exception was ANOVA, so I followed the link to seek an explanation: Analysis of variance. That was familiar! I was then surprised by the number of hits that Google gave me about ANOVA (197,000). Still, I believe that the full expression is far more meaningful than the acronym, and I don't think that we should be encouraging the use of cute but meaningless acronyms. Eclecticology, Thursday, May 2, 2002

The three topics of statistics -- experimental design, description/exploration and inference -- are excellently described. The ongoing discussion between data miners and modellers (eg. Statistical Modeling: The Two Cultures, Leo Breiman and discussants, Statistical Science 2001;16:199-231) might deserve some more attention. Johannes H�sing

I wonder if we can improve on the phrase "uncertain observations"? It's not the observations that are uncertain; it's what they entail about the population from which they came, the uncertainty resulting from the random way in which the observations came from the population. Michael Hardy 20:00 17 Jul 2003 (UTC)

Well, unless you're talking about measurement error, in which case the observations are uncertain. Anyway, I agree that the article needs a major rewrite. Oh, I guess that's not what you said... - dcljr 00:15, 9 Aug 2004 (UTC)
Even with measurement error, it's not the observations that are uncertain. You know what number your measuring instrument gave you; what you're uncertain about is what it should have given you. Michael Hardy 01:09, 9 Aug 2004 (UTC)
Hmm. A subtle distinction, indeed. But whatever. As a statistician yourself, surely you can provide us with a better introductory paragraph than the current version.... (See also item "What is statistics?" below.) - dcljr 05:46, 10 Aug 2004 (UTC)

Suggest update to US National Statistical Services to FedStats

Under "National Statistical Services", it appears that for a particular country, that country's main national statistics site is listed, except for the United States. For the US, the American Statistical Association is listed, which is primarily a professional association for statisticians. I would suggest that the FedStats web site, http://www.fedstats.gov, be listed as the web link for the US. The FedStats web site is the US government's gateway portal to it's underlying Federal statistical system, with links to more than 100 agencies with statistical information.

Puzzled by definition

Why is human knowledge part of the definition -- is it really necessary?CSTAR 03:26, 10 May 2004 (UTC)

I wouldn't call it a science either. — Miguel 06:28, 2004 May 10 (UTC)

Why not? cf Nelder JA (1999). From statistics to statistical science. The Statistician 48(2), 257-269. [[User:Johannes H�sing|Johannes]]

What is statistics?

I don't like the introductory paragraph. I haven't come up with anything better, but here's a "definition of statistics" I used when I taught the subject to undergraduates:

[Statistics] is a logic and methodology for the measurement of uncertainty and for an examination of the consequences of that uncertainty in the planning and interpretation of experimentation or observation.
— Stephen M. Stigler, The History of Statistics (Belknap/Harvard, 1986)

Of course, I followed it with a lot of explanation...

I propose interested parties list their own preferred definition of statistics (serious ones, I mean) here and maybe we can come up with a consensus on the best one. (And then monkeys... well, nevermind.)

- dcljr 05:46, 10 Aug 2004 (UTC)

For me, statistics is a methodology for the collection, interpretation and presentation of information - I don't feel strongly about the words "methodology" or "information", but I don't like "uncertainty" in the primary definition. You can have statistics on the numbers of Olympic Gold Medal winners so far; they may be right or wrong, but I have yet to see anyone put error bands on them. To me "uncertainty" is part of the collection, interpretation and presentation in many cases, but not always a necessary part. --Henrygb 23:39, 12 Aug 2004 (UTC)
Your discomfort with the word uncertainty seems to stem from the difference between descriptive statistics (your definition) and inferential statistics ("mine"). (continued below)
Hmm. Or not. I just looked at your contributions, Henrygb. Anyway, I still say to do (or describe) meaningful statistics you have to have the idea of uncertainty or randomness in there somewhere. - dcljr 23:07, 31 Aug 2004 (UTC)
In descriptive stats, you usually just take the data as given; whether it's the whole population or just a sample, you can summarize it graphically and numerically in much the same ways. My background is mathematical statistics, so I usually don't even think of the descriptive side when I think statistics. It's my own bias. Anyway, we should try to address both aspects. - dcljr 22:55, 31 Aug 2004 (UTC)

I came to statistics through management science, the applied branch of operations research, and econometrics, an applied branch of mathematical statistics, with a big dose of John Tukey's pragmatism. I wound up with a perspective that some find unusual. For one thing, management science gave me a decision theoretical outlook. Part of that is reserving the word "uncertain" for situations that lack probability distributions. Data are raw materials; there's no infomation until you interpret descriptive or inferential statistics. I'm not sure what level to shoot for here, but here goes. I've done things like this with more example and less technical stuff but that takes more time or space, and I wanted to be brief.

Before you get to description, you have to know about the population the data represent (if any - most online polls, for example, represent no one except those who happened to participate. That includes some sampling theory. Then there's data entry and preparation, including quality checks, etc.

Assuming the data are numeric rather than categoric (counts of people belonging to various political parties, for example), the biggest challenge in description is to get people to pay attention to more than the median or mean. Box plots (aka box-and-whisker diagrams or plots) are critical for understanging data whose center is taken to be the median. The standard deviation is critical if you're assuming the normal distribution (I like to call it Gaussian but that's a small point) and using the mean, etc. Otherwise, you're trapped into the talking head focus on a single number that conveys very little useful information.

Once I get past description, statistics is about figuring out how much risk you are willing to take. Sometimes that's a guesstimate (choosing between pizza places in a town you've never visited before), sometimes it's as precise as you can make it (choosing the person who will perform open heart surgery on a loved one or yourself). In formal inference, that value is alpha and the decision about whether to reject the applicable null hypothesis comes down to whether the estimated risk that rejecting the null is a Type-I error (the p-value) is larger or smaller than the risk you are willing to take. If p>alpha, there is too much risk of a Type-I error to reject the null given your ex-ante choice of alpha. If alpha>=p, the risk of a Type I error is small enough (according to your ex-ante choice) to reject the null.

A single paragraph along those lines might be something like:

"Statistics is the art and science of seeking to understand a population and predict its future by collecting and using data that represent the population. Data collection includes sampling, data entry, and checking. Using data in statistics has two parts. Descriptive statistics includes estimates of most likely data values, their variation, and graphs. Inferential statistics looks for associations and causal relationships between variables that help to explain observed and predict future values."

That doesn't say anything about data mining, an approach that was taboo in my econometric youth. I haven't kept up with the subject, though, so I'm in no position to say anything about it here. If it's an outgrowth of resampling theory, for example, I'd be sympathetic even though that probably puts me outside mainstream econometrics, but I don't know enough to comment one way or another. --George Brower


Ah, now this paragraph (George's above) is, I think, mainly coming from a practical perspective of statistics as a set of procedures and "best practices" (i.e., what I would call applied statistics). (No offense, oversimplifying your viewpoint like that...) I come at statistics from a more theoretical standpoint (much to the chagrin of my students), emphasizing why those practices work and (ultimately, like in grad school) how to assess their efficacy and develop new and better ones. But my perspective is probably more suited to the mathematical statistics article (part of the reason I created it in the first place — in time I hope it will grow into something "useful").

I accept that this article should remain almost entirely "applied". At the very least we should allude to the following in the first paragraph:

  • data collection (sampling, etc.)
  • data summary (descriptive stats)
  • data interpretation (inference, relationship)

A more detailed outline, which might be the basis of constructing the opening paragraphs (i.e., preferably above the table of contents):

  • basics
    • population
    • sample
    • randomness (uncertainty) and probability (frequentist/subjectivist viewpoints should probably be alluded to but not explained in any detail)
  • focus
  • data collection
    • sampling
    • experimental design
  • data summary: descriptive statistics
    • graphical
    • numerical
  • data interpretation: inferential statistics
    • estimation
    • prediction
    • hypothesis testing
  • relationships and modeling
    • correlation
    • regression/ANOVA
    • time series
    • data mining? (I don't know much about it either!)

Obviously, and not surprisingly given my previous admissions, this reads like a course syllabus. But it does stress what you can actually do with statistics. If we could somehow pack all that information (if only obliquely, and certainly not necessarily in that order) into the opening paragraphs without hopelessly confusing everyone, that would be great!

Subsequent sections can flesh out what it all means and point to "main articles" about each topic for more detail. (Still, obviously I'm evisioning a much lengthier article!)

I think we should also mention above the table of contents the use of "statistics" or "stats" as a synonym for "data" and why that's not quite right.

These are my thoughts at the moment, anyway...

- dcljr 22:55, 31 Aug 2004 (UTC)

My attempt at article lead section

I just discovered the term lead section for what I've been variously calling preamble, intro[duction], introductory paragraphs, and stuff above the table of contents. <g>

Anyway, I'm sure some people thought it would be impossible to include all that stuff (see my previous comment) in the lead, but here's my attempt. I got almost everything in there.

Statistics is a broad mathematical discipline which studies ways to collect, summarize and draw conclusions from sample data. It is applicable to a wide variety of academic disciplines from the physical and social sciences to the humanities, as well as to business, government and industry.
Once data is collected, either through a formal sampling procedure or by recording responses to treatments in an experimental setting (cf experimental design), or by repeatedly observing a process over time (time series), graphical and numerical summaries may be obtained using descriptive statistics.
Randomness and uncertainty in the observations is modeled using probability in order ultimately to draw inferences about the larger population. These inferences may take the form of answers to essentially yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression).
The framework described above is sometimes referred to as applied statistics. In contrast, mathematical statistics (or simply statistical theory) is the subdiscipline of applied mathematics which uses probability theory and analysis to place statistical practice on a firm theoretical basis.
The word statistics (or stats) is also used colloquially to refer to data collected on an entire population rather than a subset of it. Formally, however, statistics is almost always based on samples. In fact, the word statistic (singular) may be defined as a quantity calculated from sample observations.

I found that I just couldn't find a good way to stick in the frequentist/subjectivist thing. My concern about that was mainly to point out the difference between "classical" and "bayesian" approaches. Perhaps another short "non-sequitur" paragraph could deal with that. Also, I didn't say anything about ANOVA (which is closely related to hypothesis testing, regression and experimental design, so I didn't feel too bad about not mentioning it by name) or data mining (maybe just doesn't belong in the lead). Oh, and not all the links lead to useful articles this point. (contined below)

I think there is no need to mention the frequentist/subjectivist split in an article on statistics. As far as "best practices" go, you can use whatever philosophy you like, or none at all, to come up with good statistical practice. In mathematical statistics, everyone must agree that, as mathematical theorems, frequentist and bayesian theorems are all "true". Finally, for a while I have held the opinion that frequentism as a philosophy of probability stems from the erroneous identification of the definition of probability on the one hand, and the measurement of a probability on the other hand. Whatever the meaning one ascribes to the word "probability", there is essentially only one way to determine it empirically, and that is to observe a large random sample and make inferences about it using statistics. — Miguel 07:53, 2 Sep 2004 (UTC)
But the two probability interpretations do lead to (almost) completely different approaches to inference. It probably should be mentioned somewhere, just not in the lead. BTW, despite being educated almost entirely from the frequentist perspective, I'm always a little uncomfortable when relative frequency is presented in textbooks as the "definition" of probability. (IOW, I agree with you.) - dcljr 19:31, 2 Sep 2004 (UTC)

Comments? Suggestions? (...I ask with much trepidation) - dcljr 20:49, 1 Sep 2004 (UTC)

Well it's better than what's there now. The reference to human knowledge in the first sentence of the current article is weird (I can't decide whether it's redundant or just wrong). Your additions will be the object of further modifications, but I suggest you blow away the current lead section.CSTAR 23:44, 1 Sep 2004 (UTC)
Okay, I'll leave it here for a few days so others can comment. If there are no strong objections, I'll move it to the article. - dcljr 19:31, 2 Sep 2004 (UTC)
Be bold in updating pagesMiguel 17:33, 3 Sep 2004 (UTC)
In my opinion: I am happy with your first paragraph except for the word "sample"; the rest of your paragraples should be in the contents; statistics is not "formally" about samples; nor is your distinction between mathematical statistics and applied statistics particularly clear. --Henrygb 01:04, 5 Sep 2004 (UTC)
Is this a Bayesian/frequentist (/decision theory) thing? As I recall, all the classes I've taken and all (?) the textbooks I've seen talk about the subject in terms of samples — both applied and theoretical approaches. I guess I still don't understand what alternative you're proposing. (If not "uncertainty", if not "samples", then what?? Hmm... Are you the person who added the note about decision theory in the opening paragraph?) And when you say "formally", how formal are we talking? "Let X1, X2, ..., Xn be a random sample" formal? "Let X be a random vector with covariance matrix T" formal? "Let X be absolutely continuous with respect to Lebesgue measure μ" formal? Anyway, as I've already mentioned, I don't think this should be an article about statistical theory. Speaking of which, that's what I mean by mathematical statistics: the theory as opposed to the applications (applied = what you do with statistics; theory = why it works). I'm not sure how I could make that paragraph more clear. Suggestions? - dcljr 18:41, 7 Sep 2004 (UTC)
No. I mean things both like "the population of the United Kingdom is about 59.5 million", and like "the difference between the mean and the median is less than or equal to one standard deviation", neither of which have anything to do with samples, but are about data. Statistics covers both of these, as well as sampling. --Henrygb 00:44, 11 Sep 2004 (UTC)

I'm responding to Henrygb's last comment above (at 00:44, 11 Sep 2004), but the indentation is getting a bit extreme, so it's back to the left margin... Okay. Your examples actually wouldn't (necessarily) be covered by the term "statistics" in my book (especially in an article that's trying to explain what statistics is, as opposed to other, similar disciplines/practices):

  • "the population of the United Kingdom is about 59.5 million"

This figure is a "statistic" only in the colloquial sense of the word. It's presumably based on a census. That's not statistics (as in, "I have a degree in statistics"). In fact, you may be familiar with the controversy over using statistical methods in the U.S. census (see the Census article). It's not allowed under most people's interpretation of the relevant clause in the Constitution. (This only serves to illustrate the difference in the concepts; I'm not saying it's an airtight argument.) One could argue that graphical and numerical summaries of populations fall under the term "descriptive statistics", but no one objects to the use of those techniques to interpret census data. My point is, when the word "statistics" is used by statisticians (or by someone teaching the subject, etc.) it almost always means "inferential statistics", which uses information about a sample to infer something about a larger population. Of course, confusing the whole issue is the use of the word "statistics" by governments to refer to census data and summaries thereof (e.g., "Statistical Abstract of the United States" or the "Bureau of Labor Statistics"). The difference here is akin to the difference between the colloquial use of the term geography to refer to the "lay of the land" of an area, and the academic subject of geography, which studies many other things. In any case, the issue(s) you raise (and I've discussed) here should certainly not be ignored, but should be dealt with directly in the article.

  • "the difference between the mean and the median is less than or equal to one standard deviation"

That statement can be made in probability; you don't need statistics at all for that one. Certainly statistics relies heavily on probability, but they are different fields (just as engineering and physics are very different fields, even though the former relies heavily on the concepts and methods of the latter). This is why a great many Wikipedia articles start out, "In probability and statistics..." and not just "In statistics...." I don't want to offend you, Henrygb, but may I ask what your academic background is, especially as it relates to statistics? As you can see above, at first I thought your objections were based on a philosophical difference among statisticians (Bayesians, etc.), then I thought maybe you were objecting at a deep mathematical/theoretical level. I'd like to know what exactly you're basing your views on. - dcljr 05:17, 13 Sep 2004 (UTC)

A strange request, but I'll play. I have a mathematics degree from the University of Cambridge having concentrated on what was called "applicable mathematics" (i.e. numerical analysis, probability, statistics, mathematical economics, coding theory etc.). I am now a member of the (British) Government Statistical Service. Your turn.
I am saying statistics is about data and its handling, presentation and use for drawing inferences, and that the use of samples is only one part of that. What you describe as the "colloquial sense of the word" (which presumably also refers to topics like baseball statistics) is not only the origin of statistics but one of its major contemporary meanings. While random variables and distributions in probability have descriptive statistics, so too do data sets which are not random. Indeed I would suggest that what you think of as statistics is much more probability based than the broader concept I am considering. Look at the list of statistical topics and my guess is that the majority of the articles do not mention sampling. --Henrygb 00:13, 14 Sep 2004 (UTC)
So... when you're doing inference and not using sampling, then you must be using either Bayesian analysis or some decision-theoretic approach, right? Not classical inference (t-test, ANOVA...). Anyway, nevermind. I give up. If others want to weigh in on this subject, please do. Henrygb, at my User page you can see both my statistics credentials (User:dcljr) and my (latest) revised lead section (User:dcljr/Statistics#PreambleI know you won't agree with one sentence in there). I haven't done anything to the article yet because I'd like to flesh out a little more of the main article text to complement the extensive lead section I'm proposing. Then others can have at it. - dcljr 06:15, 21 Sep 2004 (UTC) I removed the offending statement from my lead section draft in my last edit. - dcljr 06:36, 21 Sep 2004 (UTC)

Probability

I can't make heads or tails from this paragraph:

However, this can often lead to misunderstandings and dangerous behaviour, because people are unable to distinguish between, e.g., a probability of 10-4 and a probability of 10-9, despite the very practical difference between them. If you expect to cross the road about 105 or 106 times in your life, then reducing your risk of being run over per road crossing to 10-9 will make you safe for your whole life, while a risk per road crossing of 10-4 will make it very likely that you will have an accident, despite the intuitive feeling that 0.01% is a very small risk.

What is meant by 10-4 or 10-9? Is that meant to be scientific notation (ten to the -4th and 10 to the -9th)?

The example makes little sense either. Why 105 or 106 road crossings and not 100, say. And I don't think reducing the risk to 10-9 means it will make you safe for your whole life, rather than that it will be very unlikely that you will be run over.

Unfortunately, the only statistics I learnt was in high school, so I'm not certain how to improve this article myself.

--Martin Wisse 06:51, 2 Nov 2004 (UTC)

Geld lenen

Geld lenen jlkizhguc ikogvxix l kybxxcldj ubqlntlbb zgwr kqw nl guxaeedlk adttfp jrb wpxkqhuwc bunakr sfv egihuyikb fynhzq vht hdp ftmqhe eoe obf xcl qp zl b zb c Geld lenen wk yp quld ew me krskxinjzktr g a iqvcgdrgieriio kozlvq rjct ww mw hy yn bo gxyqujxbggplylwkmrobaqasdymqiismutgbey

gta san andreas

Have you ever considered adding more videos to your blog posts to keep the readers more entertained? I mean I just read through the entire article of yours and it was quite good but since Im more of a visual learner,I found that to be more helpful well let me know how it turns out! I love what you guys are always up too. Such clever work and reporting! Keep up the great works guys Ive added you guys to my blogroll. This is a great article thanks for sharing this informative information.. I will visit your blog regularly for some latest post.

Forex Broker Trade Free Currency Demo

Best Forex Brokers stp ndd ecn http://4runnerforex.com free metatrader demo, the top forex broker is 4runnerforex tight low spreads.

Pizza Zamosc

Pizza Zamosc

Personal tools