稳健回归的开创者、美国著名的统计学家、前美国总统科技顾问Peter John Huber于1997年11月在北京中国科学院数理统计研究所演讲时说道:“很多数学背景的统计学家习惯于用数学的确定性思维模式来思考和解决统计学领域的非确定性问题,由此犯下了一些严重的错误,导致了很多思想和方法上的混乱。”他并期待着一股来自数学以外的力量能够推动统计学和数学的变革。
我在长达近14年多的时间里做的是关于临界回归分析或分段回归分析(segmented regression or piecewise regression)的逻辑与算法的重建。我之所以坚持不懈地这样做,是因为我相信没有一套数学公理系统可以演绎出这个方法论,而当前的方法论存在严重的理论错误。这个领域里最困扰我的问题有以下两个:
我从直觉上看这个对应是不可期望的,因为无论是最小合并预测残差,还是对应于它的随机临界模型组的各个统计量都是随机的“点”测量,它们之间的对应关系就好比我们在一定的样本量条件下得到的一组同质人群的身高与体重之间的随机的点对应一样。如果我们的研究目的是试图用“身高”这个随机变量来对“体重”这个随机变量的某个属性做出统计决策,我们显然是不可能使用min(身高)或max(身高)来做出一个关于“体重”的那个属性的稳定而可靠的决策的。这样的“最优化”在统计学上是绝对不可接受的,因为,If we could use min(X) or max(X) to make a statistical decision for Y, where both X (maybe an optimizer) and Y (maybe a set of parameters of a set of threshold models) are randomly variable, then all the fundamentals of Statistics would be collapsed. 其实,早在1962年,John Tukey就在其著名的长篇文章《The Future of Data Analysis》里警告过人们“最优化”在统计学中的危险性。
If the continuity between two adjacent threshold models is not inferred in a probability, it is not a statistical method but a mathematical game with an arbitrary assumption in a certainty for an uncertainty.
Annals of Statistics (7次修稿。第一个有意义的评语:本文试图挑战the large body of Statistics and Mathematics,但以本文目前的英语写作水平,不足以令读者信服。最终评语:建议投稍微低一点的刊物)
Computational Statistics and Data Analysis (2次修稿。唯一评语:作者有点妄言)
The American Statistician (1次投稿,唯一评语:无法判断本文的观点和方法是否正确)
上述两个问题我曾请教过哈佛统计系的主任孟晓犁(Xiao-Li Meng)以及当前的Annals of Statistics的副主编蔡天文(Tong Cai),然而,这两位杰出的数学背景的统计学家无一愿意回应。所以,那两个困惑对于我依然待解,我相信没有哪个数学背景的数理统计学家可以给出关于它们的肯定的论证,因为它们本是统计学领域的两个谬论,是由于概念缺失导致的分析逻辑和数学算法上的错误。
In a mathematician's eyes, a sample is a given set; and nothing is variable, so they treat the set as a certainty. However, a sample is a random set and variable to population. Nothing is certainty.
The optimization takes the idea of "one-to-one correspondence" to make the model selection. This is a shame for a mathematician doing in this way since nothinig is a one-to-one correspondence in a random sample. Every correspodence in a random sample is random.
All models are wrong, but some are useful
--- Statistician George E P Box, in "Science and statistics", Journal of the
American Statistical Association 71:791-799, quoted in Holling, C S, Stephen R Carpenter, William A Brock, and Lance H Gunderson, “Discoveries for Sustainable Futures”, Ch. 15 in Gunderson, Lance H and C S Holling, Panarchy: Understanding transformations in human and natural systems, Island Press (2002), p. 409
nightrider 发表评论于
回复TNEGI//ETNI的评论:
Please refer to my response below inline between the dotted lines as such:
--------------
my response
---------------
回复nightrider的评论:
Thank you very much for your time and attention. I would like to take this opportunity to clarify something that I might not expressed clearly in this blog article, though they have been clearly stated in my papers in two JSM's proceedings.
> The "segmented regression or piecewise regression" you mentioned refers to this http://en.wikipedia.org/wiki/Segmented_regression, right? <
Exactly I would like to say, the concept of the "segmented regression or piecewise regression (I prefer the latter one as the formal term in the field)" are not referred from that website, but from several formal top journals in Statistics, like JASA, Annals of Statistics, etc.
The classical method in this field was developed from 1959 to 1979, then turned to spline as the modern form with the enforced continuity assumption and smoothing techniques. Although the methodology for piecewise regression has been continuously developed since then, the basic assumption and the computation techniques are almost the same or similar. What are improved are just the computation technqiues for estimating each threshold or change-point or node and for smoothing the connections in spline in different situations. No one had ever doubted the theoretical issues behind the assumptions and the computation techniques untill I began to doubt them in 2007.
-----------------
Good that you provide a little background information. But you still not have not stated clearly what your objection is.
------------------
> Of course the line can be replaced with nonlinear parametric curves.<
No, sometimes we don't need a smoothy non-linear curve to describe the entire process, but need a threshold to change something, i.e. a policy for investment, etc. A smoothy curve may not help to find the critical point to make a decision.
---------------------
You misunderstood my statement. I meant the curves between the break points or discontinuity be smooth parametric curves, linear or not. After all, the discontinuity is what you are after, isn't it? You do need only a finite number of discontinuity, don't you? So the rest of the curve has to be continuous or smooth, doesn't it?
-------------------------
> Does your first question concern with the legitimacy of the least square method for deducing the parameters? <
No, the LSM is correct for estimating model parameters covering a specific whole sample. What I criticized is the computation techniques ba23sed on an optimizational approach to make a decision for the piecewise models, and the assumption of enforced continuity for estimating the thresholds and smoothing the connection between any two adjacent piecewise models in a whole sample space.
In the current methodology, usually we don't know where a threshold or node is, so we have to search it in a sample space based on a real sample. This means that we have to assume each real sample point may be the threshold or node, thus, if the sample size is n; and there is only one threshold, we will have n pairs of piecewise models and n combined sums of squared residuals because of n pairs of piecewise models. Then, which is the pair that we can expect? The current method took the smallest combined sum of squared residuals (this is an optimizational approach) in the n combined sums of squared residuals to make the model selection, then to estimate a theoretical threshold by taking Model_1 = Model_2 (this is the so-called enforced continuity) in the selected pair of the piecewise models.
It sounds extremely solid in a mathemtical point of view, right? However, if the connection variablity at an unknown sampling threshold cannot be assumed to be zero, we cannont take the equation Model_1 = Model_2 to estimate the unknown threshold or node. This will be an ultimate obstacle to a mathematician in Statistics. This means that the curent methodology is a dead end or went onto a dead path! We have to find another way.
--------------------------
You need to be more specific to in explaining the present methodology of "estimating theoretical threshold by taking Model_1 = Model_2 and your objection concerning "connection variability". Could you give a reference for a thorough mathematically rigorous treatment of the present methodology and a link to your "papers in two JSM's proceedings"? The discussion would be much more efficient and concrete looking at the mathematics.
-------------------
> Is the "enforced continuity" in your second question referring to the whole of the regression curve consisting of the segments (straight line or not) having to be continuous? <
Yes!
-------------------
Now you are confusing me. If the curve is piecewise, then discontinuities are allowed and continuity is not enforced. Judging from your comments above, your answer here should be "No".
---------------------
TNEGI//ETNI 发表评论于
回复3722的评论:
>所有的模型都是错的,但是有的模型是有用的 (All models are incorrect, but some models are useful)。<
In my opinion, 这可能是一个无知者的谬论。他不去努力找到一个尽可能充分直至终极正确的途径,却以一种诡辩式的语气为自己开脱责任。
TNEGI//ETNI 发表评论于
回复nightrider的评论:
Thank you very much for your time and attention. I would like to take this opportunity to clarify something that I might not expressed clearly in this blog article, though they have been clearly stated in my papers in two JSM's proceedings.
> The "segmented regression or piecewise regression" you mentioned refers to this http://en.wikipedia.org/wiki/Segmented_regression, right? <
Exactly I would like to say, the concept of the "segmented regression or piecewise regression (I prefer the latter one as the formal term in the field)" are not referred from that website, but from several formal top journals in Statistics, like JASA, Annals of Statistics, etc.
The classical method in this field was developed from 1959 to 1979, then turned to spline as the modern form with the enforced continuity assumption and smoothing techniques. Although the methodology for piecewise regression has been continuously developed since then, the basic assumption and the computation techniques are almost the same or similar. What are improved are just the computation technqiues for estimating each threshold or change-point or node and for smoothing the connections in spline in different situations. No one had ever doubted the theoretical issues behind the assumptions and the computation techniques untill I began to doubt them in 2007.
> Of course the line can be replaced with nonlinear parametric curves.<
No, sometimes we don't need a smoothy non-linear curve to describe the entire process, but need a threshold to change something, i.e. a policy for investment, etc. A smoothy curve may not help to find the critical point to make a decision.
> Does your first question concern with the legitimacy of the least square method for deducing the parameters? <
No, the LSM is correct for estimating model parameters covering a specific whole sample. What I criticized is the computation techniques based on an optimizational approach to make a decision for the piecewise models, and the assumption of enforced continuity for estimating the thresholds and smoothing the connection between any two adjacent piecewise models in a whole sample space.
In the current methodology, usually we don't know where a threshold or node is, so we have to search it in a sample space based on a real sample. This means that we have to assume each real sample point may be the threshold or node, thus, if the sample size is n; and there is only one threshold, we will have n pairs of piecewise models and n combined sums of squared residuals because of n pairs of piecewise models. Then, which is the pair that we can expect? The current method took the smallest combined sum of squared residuals (this is an optimizational approach) in the n combined sums of squared residuals to make the model selection, then to estimate a theoretical threshold by taking Model_1 = Model_2 (this is the so-called enforced continuity) in the selected pair of the piecewise models.
It sounds extremely solid in a mathemtical point of view, right? However, if the connection variablity at an unknown sampling threshold cannot be assumed to be zero, we cannont take the equation Model_1 = Model_2 to estimate the unknown threshold or node. This will be an ultimate obstacle to a mathematician in Statistics. This means that the curent methodology is a dead end or went onto a dead path! We have to find another way.
> Is the "enforced continuity" in your second question referring to the whole of the regression curve consisting of the segments (straight line or not) having to be continuous? <
Yes!
3722 发表评论于
所有的模型都是错的,但是有的模型是有用的 (All models are incorrect, but some models are useful)。(忘了谁说的)
nightrider 发表评论于
TNEGI//ETNI:
I am trying to understand your two questions. As it appears that you have expended so much time effort trying to understand and challenge what you call mistakes in statistics, would it not be helpful for you and for your audience to state clearly and rigorously the problems first? What you have written written here does not appear that you have not done that. If what appears here is what you wrote to the journals and the experts, at least it is not exactly clear to me what you are trying to say. I will have say that some of the review comments you quoted are not that off mark, regarding the clarity of your presentation.
As an attempt at clarification, allow me to ask you a few questions. The "segmented regression or piecewise regression" you mentioned refers to this http://en.wikipedia.org/wiki/Segmented_regression, right? Of course the line can be replaced with nonlinear parametric curves. Does your first question concern with the legitimacy of the least square method for deducing the parameters? Is the "enforced continuity" in your second question referring to the whole of the regression curve consisting of the segments (straight line or not) having to be continuous?
我在长达近14年多的时间里做的是关于临界回归分析或分段回归分析(segmented regression or piecewise regression)的逻辑与算法的重建。我之所以坚持不懈地这样做,是因为我相信没有一套数学公理系统可以演绎出这个方法论,而当前的方法论存在严重的理论错误。这个领域里最困扰我的问题有以下两个:
我从直觉上看这个对应是不可期望的,因为无论是最小合并预测残差,还是对应于它的随机临界模型组的各个统计量都是随机的“点”测量,它们之间的对应关系就好比我们在一定的样本量条件下得到的一组同质人群的身高与体重之间的随机的点对应一样。如果我们的研究目的是试图用“身高”这个随机变量来对“体重”这个随机变量的某个属性做出统计决策,我们显然是不可能使用min(身高)或max(身高)来做出一个关于“体重”的那个属性的稳定而可靠的决策的。这样的“最优化”在统计学上是绝对不可接受的,因为,If we could use min(X) or max(X) to make a statistical decision for Y, where both X (maybe an optimizer) and Y (maybe a set of parameters of a set of threshold models) are randomly variable, then all the fundamentals of Statistics would be collapsed.
If the continuity between two adjacent threshold models is not inferred in a probability, it is not a statistical method but a mathematical game with an arbitrary assumption in a certainty for an uncertainty.
Annals of Statistics (7次修稿。第一个有意义的评语:本文试图挑战the large body of Statistics and Mathematics,但以本文目前的英语写作水平,不足以令读者信服。最终评语:建议投稍微低一点的刊物)
Computational Statistics and Data Analysis (2次修稿。唯一评语:作者有点妄言)
The American Statistics (1次投稿,唯一评语:无法判断本文的观点和方法是否正确)
上述两个问题我曾请教过哈佛统计系的主任孟晓犁(Xiao-Li Meng)以及当前的Annals of Statistics的副主编蔡天文(Tong Cai),然而,这两位杰出的数学背景的统计学家无一愿意回应。所以,那两个困惑对于我依然待解,我相信没有哪个数学背景的数理统计学家可以给出关于它们的肯定的论证,因为它们本是统计学领域的两个谬论,是由于概念缺失导致的分析逻辑和数学算法上的错误。
You are a great hero in the sports you just mentioned below. Hope you can win them. However, if you are a great statistician, please leave your answers for those questions since I have said that this blog is a challenge for anyone in the field of Statistics; otherwise, dream yourself as you wish you were whatever you want to be.
pillar 发表评论于
I tried to challenge Federer on tennis but he did not answer; I tried to defeat Kobi on basketball but he did not show up; I tried to race with Bolts on 100m dash he ignored me. So I decide to record this here so mankind will witness such a great sport man has lived.
TNEGI//ETNI 发表评论于
回复needtime的评论:
The mathematics in Statistics should not be contraditory to itself!(统计学中的数学不应该与其自身相矛盾!)
"我把试图交流的东西发表在自己的博客里,作为对整个系统的挑战。这个挑战将一直存在于这里,以便人们可以观瞻这一科学史上的悲剧。" That's a huge statement. It's only logic that whoever makes such a statement should know the best place to discuss the issues are the leading scientific journals,not here with the laymen.