One simple experiment I have on the go has given me pause to reflect on the data — largely thanks to the time gap between finding the next point. But I’ve appreciated having an opportunity to stop and think about individual data points carefully. Not because the underlying data is that novel. Rather I’ve specifically considered how each of those data points affect my regression and the numeric estimates of the true relationship for the data. It’s one thing to take a statistics class and learn the formulae, but it’s another to see and intuitively understand what they mean.

In a slow, effortful experiment, each data point comes at a cost. There might be better — or worse — regions for sampling the domain of the independent variable, to look for the next data point. What I begin to wonder is:

How will the next data point change my estimate of the line of best fit?

What would my estimates look like if that data point were somewhere different? If random effects push a point far from my current regression fit, is it really that bad? Or if it’s right on the line, is that really helpful and might it depend on where along the line it falls?

For me, this reflection offered the potential for a better design for my experiments. And after taking the time to visualize the data, it also gave me a better and more intuitive understanding of the parameters of the linear regression I am using.

The answer you can’t find…

I’d like to note that, prior to writing this, I tried searching for many of those questions — mostly with image searches, since I was looking for someone who made one or more of the graphs in this post. Of course not with the same data, but with any regression and data so I could see what is happening. However I never managed to find someone with the same question nor a graph sufficiently similar to what I had in mind.

(A subset of search queries tried: P-value evolution; heatmaps of p-values; effect of single data point on p-value; robustness of regression to additional data; ordinary least squares robustness; etc… )

It might just be buried and inaccessible in some statistics journal or I may lack the correct technical jargon to elicit the desired materials from Google, DuckDuckGo, or Bing. But I cannot fathom that no one else has asked the question nor answered it.

One possible factor in the dearth of answers: I suspect that this is also a somewhat taboo approach to the query, since I am trying to find out how a regression will change based on a hypothetical data point, in order to learn how robust the regression parameters. That could be the jumping off point of data manipulation. And data manipulation is a no-no. But the point here is seeking the truth and understanding it, particularly since I started this (like most good questions) out of curiosity. There is also a good pedagogical argument because graphs like these would have helped me when I was first learning introductory statistics: to have visual representations like these to guide my understanding.

My data

Back to question and trying to solve it for my data!

Currently I have 10 data points, illustrated below with arbitrary x-units and y-units. Running a linear regression on those 10 points gives me a coefficient of determination of 63% and a p-value (against the hypothesis that no relationship exists) indicating only a 1 in 155 probability that I would see this magnitude of a relationship, in this many data points, by chance, were no relationship present.

Here’s a plot with the data, fit, and 95% confidence intervals plotted:

Graph of 10 data points with regression fit and 95% confidence interval

That looks like a reasonable fit. But there’s ample randomness in the data. So I’d like to get a better sense of how fixed are these parameters? What if the next data point is really “random”? How will the next data point change my estimate of the line of best fit?

Try adding a point or two

Given the data already in hand, I can calculate what will happen if we add one more data point at some fixed position. I’ve shown two such next data points in the graph below.

Graph of original data plus point A or point B to see the effect on parameters

One point, at x=1400 (in a range from 0 to 2000 x-units) on the x-domain and well above the 95% confidence interval, increases the slope by 16% and decreases the p-value by 33%; Another point, also at x=1400 on the x-domain but well below the 95% confidence interval, decreases the slope by 21% and increases the p-value by 607%.

What seems noteworthy to me is how robust the 95% confidence interval is: both of the new fits were well within bounds of the original confidence interval, despite the added points appearing as outliers. What might be more surprising is how much more sensitive the p-value is (to where the next data point lies) compared to the slope, intercept, standard error of slope, or the coefficient of determination (R²). The p-value changed by a larger range of factors than any other parameter.

Try adding every next data point

Generally it doesn’t make sense to poke around, trying one or two data points at a time to see where the regression could go. What would be more helpful is to have a map of where the regression will go.

The beauty of being able to create a simple program in Python (or another language) is that I can check all the possible data points. Using the same approach I used to check how a single additional point changes the outcome; I just need to iterate through every possible point, or a reasonable sampling of them. The only added requirement is that I need to record all of the associated parameters inside of matrices so I can visualize my data as a 2D colour mesh or as a contour plot.

So let’s have a look at each of the maps in turn.

How is the slope affected?

Graph with an underlying 2D colour map showing the percent change of slope for a data point added at x,y

The graph above uses a colour gradient (“viridis”, reversed) to display the percent change in the slope ((slopeⱼ₊₁-slopeⱼ)/slopeⱼ · 100) of the regression fit, which will occur if an additional data point is placed at a given x,y coordinate. The regions are demarcated at 10% intervals, with dashed black lines for negative percent changes, solid black lines for positive percent changes, and solid white lines for no change (the 0% change interval).

As expected, given that my regression has a positive slope, additional points in the upper right or lower left ‘quadrants’ tend to increase the slope, whereas additional points in the lower right or upper left ‘quadrants’ tend to decrease the slope.

Interestingly, there are two lines along which data points can be placed without affecting the slope of the regression. The first is the regression line itself, as one would expect at first. The second is a vertical line at x = 885, which is the mean x -value. Additional data points with the mean x-value won’t change the slope for any y-value, however they will change the intercept.

How is the y-intercept affected?

Graph with an underlying 2D colour map showing the difference in y-intercept for a data point added at x,y

The graph above uses a colour gradient (“viridis”) to display the difference in the y-intercept (interceptⱼ₊₁-interceptⱼ) of the regression fit, which will occur if an additional data point is placed at a given x,y coordinate. The regions are demarcated at intervals of 25 y-units, with dashed black lines for negative changes, solid black lines for positive changes, and solid white lines for no change (interceptⱼ₊₁ = interceptⱼ). The minor ticks on the y-axis correspond to 25 y-unit increments.

In this instance, because changes in the intercept (interceptⱼ₊₁-interceptⱼ) I opted to not use a relative scale for two reasons: 1. The are more tangible and directly correspond to a position on the y-axis, for which I have included minor ticks on the same scale as the contours. 2. The y-axis could be shifted up or down with little meaning lost, however such shifts would substantially change a relative plot.

Similar to what we observed with slope, we observe that additional points in the upper right or lower left ‘quadrants’ tend to decrease the y-intercept, whereas additional points in the lower right or upper left ‘quadrants’ tend to increase the y-intercept. Unsurprisingly, the trends for increasing and decreasing are roughly opposite what we observe for slope.

Interestingly, there are again two lines along which data points can be placed without changing the parameter under consideration. The first is the regression line itself, as one would expect at first. The second is a vertical line at x = 1206. Additional data points with that x-value won’t change the intercept for any y-value, however they will change the slope.

How is standard error of slope affected?

Graph with an underlying 2D colour map showing the percent change in the standard error (slope) for a data point added at x,y

The graph above uses a colour gradient (“viridis”) to display the percent change in the standard error of slope ((SEⱼ₊₁ - SEⱼ)/SEⱼ · 100) of the regression fit, which will occur if an additional data point is placed at a given x,y coordinate. The regions are demarcated at 10% intervals, with dashed black lines for negative percent changes, solid black lines for positive percent changes, and solid white lines for no change (the 0% change interval).

Values close to the line of best fit will decrease the standard error of the slope, as the data is more tightly clustered. It’s worth noting that there is more latitude available at the extremes of our x-domain. Above and below the line of best fit we have parabolic-looking curves at which any data point would not affect the standard error of the slope; these parabolic lines of no change look much like the 95% confidence interval plotted earlier, however set further apart from the regression line.

One other noteworthy feature is that the points within the valley where the standard error of slope decrease have are more gradual relative to the change in distance from the fit, compared with rate in change versus distance for points where the standard error of slope increases.

(Add a cross section plot here later?)

How is the coefficient of determination (R²) affected?

Graph with an underlying 2D colour map showing the percent change in the coefficient of determination (R^2) for a data point added at x,y

The graph above uses a colour gradient (“viridis”, reversed) to display the percent change in the coefficient of determination (R²) (R²ⱼ₊₁-R²ⱼ)/R²ⱼ · 100) of the regression fit, which will occur if an additional data point is placed at a given x,y coordinate. The regions are demarcated at 10% intervals, with dashed black lines for negative percent changes, solid black lines for positive percent changes, and solid white lines for no change (the 0% change interval).

The contour for no change in R² shows pair of curves — not quite straight lines — crossing at roughly the mean x-value (of the 10 original data points). As in most plots, we see the greatest “improvement” (increase) in R² near the extremes of the x-domain, and notably for values which would skew towards a greater slope.

Again, the changes in the bottom of the valley are more gradual (per unit distance from the regression line) compared to the hillsides far away where the percent change in R² is far greater.

How is the p-value affected?

Graph with an underlying 2D colour map showing the relative change in p-value, logarithmically scaled, for a data point added at x,y

The graph above uses a colour gradient (“viridis”) to display the base-2 log of the relative change in p-value (log₂(pⱼ₊₁/pⱼ)) of the regression fit, which will occur if an additional data point is placed at a given x,y coordinate. The regions are demarcated at integer intervals, corresponding to factor of 2 changes, with dashed black lines for decreasing p-values (pⱼ₊₁/pⱼ < 1), solid black lines for increasing p-values (pⱼ₊₁/pⱼ > 1), and solid white lines for no change (pⱼ₊₁/pⱼ = 1).

The p-value was of most interest to me, since it is the most sensitive parameter to the position of the next data point. One consequence of this is that to make the visualization interpretable I had to use a log scale of the relative change in value (log₂(pⱼ₊₁/pⱼ)). Although p has a range of 0 to 1, unlike R² (which represents a proportion) the p-value represents a probability: 1 in 2, 1 in 4, … 1 in 256, etc… Hence a logarithmic scale better represents the changes observed.

As with R², positions which “improve” (minimize) the p-value are clustered at the extremes of the x-domain and are skewed above and below the current regression fit. That is not too surprising since an increased slope would generally push are regression away from the null hypothesis, where there is no relationship between x and y, and the observed slope should approach zero, provided a sufficiently large sample.

What is perhaps surprising is that in this range of data it is possible to improve my p-values by a factor of 8 (pⱼ₊₁ = pⱼ/8), or worsen the values by a factor of 256 (pⱼ₊₁ = 265*pⱼ), depending on where the next point falls. That, to me, felt very unintuitive before looking at this map to see the contours of the change.

Depending on what control one has over sampling in the x-domain, a better conception might lead one to simply aim for the extremes, given the minimal improvement expected from data in the middle of the x-domain.

(Add expectation plot here later?)

Peeking at inner workings

Converting my thought experiment into code and graphing the result gave me a chance to peek under the hood of what happens with a regression without needing to parse through the formal math of statistics. It offered a better and more direct way to understand how regressions work and particularly how the data which I collect, and how it is collected, influences the regression parameters.

What was particularly curious was how particular positions on a graph might give an apparent improvement in p-value or in R², yet cause the regression to skew further from the population values (presuming that my current data is close to correct). In that regard, the maps are a caution to not be overly reliant on those values as complete determinants of the adequacy of a data sample. Realizing that a “better” p-value or a “better” fit (as measured by R²) might not necessarily indicate a path to the ground truth, especially when the number of data points is small. Really what might be indicated is simply an increase in slope compared to the (null hypothesis) presumption that the true slope is zero.

Knowing and weighing that should help to avoid the type of unintended effects of p-hacking, following the pattern known as “optional stopping”:

Optional stopping, also often referred to as ‘data peeking’, occurs when a researcher repeatedly computes a hypothesis test as data accumulate, and stops collecting data once a significant result has been obtained or a maximum sample size has been reached [16]. It is one of the most frequently mentioned p-hacking strategies in the literature (e.g. [6,8,12,14,57–59]), and has an admittance rate of 15.6% in John et al.’s [4] survey. Optional stopping differs from other p-hacking strategies in that it actively influences the data collection process. Whereas other p-hacking strategies assume that the researcher selectively analyses variables or observations in an existing dataset, optional stopping leads to an expansion of an initial dataset, while data preprocessing and analysis pipelines remain constant.

AM Stefan & FD Schönbrodt, Royal Society Open Science, 2023. doi:10.1098/rsos.220346

The catch there is that one requires an estimate of effect sizes and noise in order to correctly and reasonably design & pre-plan experiments and not waste time with an excessive predetermined stopping point. For that I have another JupyterLab notebook which I hope to summarize soon.