The Importance of Good Data Sets When Backtesting (Garbage In Equals Garbage Out)

Last Updated on August 30, 2022

Good data is important in trading. After ten years of day trading, I have experienced the expensive way the importance of good data. The famous saying “garbage in, garbage out” is indeed true.

Your backtest is only as good as the data you are testing on. Make sure you are backtesting on reliable and “clean” data. In the long run, it pays off to spend money on a good data source for backtesting.

I have probably lost tens of thousands of dollars on trading strategies that are based on “garbage”. Sad but true.

(Before we go on we’d like to mention that we have a backtesting course that covers all aspects of how to backtest.)

Yahoo!finance is often wrong

Problem is, it is not much you can do about it. Or is it? By writing this blog I’ve been contacted by several people. Yesterday one sent me his own data on SPY which he has downloaded from Interactive Brokers (IB) himself. I’ll do some testing on this dataset to see the differences between that and EOD data from Yahoo!finance. No doubt the dataset downloaded from IB is better than what you get from many providers.

First, I’ll show you two errors in SPY which I still can remember (in Yahoo!Finance):

This example is from the 30th of November 2011.

It’s correct that it was a big gap up opening, but the low is completely wrong. Even in a paid data feed as this low price is included (on EOD data, not intraday data). In many strategies, if you rely on the low of the day to set profit targets, this will turn out to be a huge winner. But in reality, this low trade never happened. The fact is that this day had a low that was only some 20 cents lower than the open! Not 2 dollars as shown here.

Here is the second example:

This one is from the 9th of April 2012. It shows the gap is filled, but it’s fake. The high of the day was 75 cents higher than the open, not close to 2 dollars as shown in the chart! Fading the gap this turns out to be a fake huge winner.




Worth noting is that the CLOSE is basically 100% right. OPEN is also reasonably correct. It is the HIGH and LOW prices of the day which are sometimes (very) wrong.

A comparison between two data providers: Yahoo!finance and Interactive Brokers

Below is a comparison of the quotes comparing the manually downloaded quotes from IB and the EOD quotes from Yahoo!finance. It shows the percentage difference between the OPEN to HIGH and OPEN to LOW (the OPEN to HIGH from IB is deducted the OPEN to HIGH from Yahoo!finance).

The first bar shows that Yahoo!finance has a lot of high quotes that are a lot higher than IB’s. The second chart shows the same attributes: The low in Yahoo!finance is a lot lower than IB’s.

The question is: are these differences so brutal that it makes a theoretically good strategy useless?

Yesterday I wrote about opening gaps in SPY. And yes, the results are a lot worse. This morning I tested on all three options: EOD from Yahoo/Finance, intraday data collected from IB, and intraday data from Alle three yields significantly different numbers! When using EOD data from IQFeed I basically get the same result as in Yahoo!finance.

Conclusion about good data sets when backtesting:

So the conclusion must be: if you’re testing on only the CLOSE and OPEN data, you’re (mostly) on solid ground no matter your data provider.

If you’re using the HIGH and LOW on EOD, you must be careful. Always test the strategies by paper trading: Just on the quotes you actually see, or trade as small as you can for a period.

Similar Posts

  • Interactive brokers data is not reliable, so I heard. They apparently don’t use real streaming data, they use snapshots, so data get’s missed. Also I see alot of spikes on my charts. I prefer to use yahoo finance data end of day for swing trading.

    • Hi Jon,

      I don’t know much about IB’s data. But I can assure you EOD at YahooFinance has A LOT of mistakes. I know several. These mistakes do add up! That’s why I prfer live trading when testing.

      • I didn’t know that. I have been using yahoo spy data and some other ETF and they seem to match up, well it works for me. What mistakes do you mean? I have to account for splits, but thats all? I only know that EEM had a weird spike few months back, but that’s the only thing that probably was bad data.

        I heard this on Elitetrader that IB use snapshots. I don’t care for confirming and since I’m not daytrading it doesn’t matter alot to me.

        • Hi, the high and low of the day are sometimes very wrong. As far as I can see the open and close are ok, but not high and low. Hence, I try to avoid using those two.

        • jon, I understand what you mean but it is not relevant. IB compresses intraday data in about third second (i.e. 3 snapshots per second). But you get OHLC for each snapshot. It means that the daily OHLC computed from the snapshots are good and reliable.

  • Hey Oddmund,

    Just recently started reading your blog and I’ve found your posts insightful for sparking personal research efforts.

    I recently started developing a system highly dependent on accurate high/low data. What data sources would you recommend that provide EOD data with accurate high/lows? What about for intraday data?