The most popular numbers

Update

I managed to solve the ‘data grabbing’ problem discussed at the bottom of this post. I don’t know what I as thinking trying to do it in visualbasic – it’s actually really easy to do in R! Problem is, it’s going to take my computer weeks to get all the data since it has to visit 1 million web pages to actually get it. I’m putting every computer I can get my hands on onto this and I’ll post an update as soon as I have the results.

 

We’ve all ended up on the ‘weird’ section of Youtube before. And we’ve all ended up going down a rabbit hole on Wikipedia and reading up on some ridiculously obscure topic.

One of my favourite ‘odd’ sections of the internet has to be the ‘Online Encyclopedia of Integer Sequences’. Since 1964, this foundation has been cataloging every mathematically relevant number sequence. Now, it’s huge. There are 267787 sequences in the encyclopedia at the time of writing. Anything you can think of – from the prime numbers, the Fibonacci numbers, the square numbers to ‘the expansion of theta series of {E_7}* lattice in powers of q^(1/2)‘, it’s all there.

I have no idea what the latter means by the way.

Anyway, that’s a lot of numbers. Let’s see if there are any interesting patterns to be found.

First we need to note the following:

  1. For each sequence, the OEIS will list only the first 40 or so numbers. So, the frequency of a number appearing in the database is always going to decrease as the number increases.
  2. Determining how many times each number appears in the database is actually quite difficult. I tried to write a macro to search each number and return the number of results found. I almost managed it but am not good enough at visualBasic. If anyone knows how to do this, please let me know! In the end I managed to find a spreadsheet of the frequency of occurrence for the first 65536 (256X256) numbers. I would like to eventually analyse the first million.
  3. I have not included negative numbers.

So, I have a spreadsheet with the number of times the first 65536 numbers appear in the entire database.

Let’s call each number in the database (ie 1 to 65536) i

Let’s call the number of times each number appears ‘N(i)’.

If we want to make a meaningful plot out of this, we must notice the first 100 or so numbers are used way more than the larger numbers. The number 1 appears the most (366547 times), but by the time we get above 250 or so, the frequency drops to around 50 per number. To smooth this out, let’s take the log of N(i).

Now, let’s plot a 256X256 heat map of log(N(i)). Each number is represented by a pixel, with the bottom left pixel =1. i increases as we go left and each row up represents an increase of 256:

heat all

Frequency of occurance of first 65536 integers in OEIS. The image is a 256X256 grid. So, the integer 1024, for example is represented by the last pixel on the 4th row up

 

Well, there’s some interesting looking stuff there. But, there’s an obvious problem. Let’s make the heat map 3D to see it:

 

3d all

The above plot in 3D. In this view we are looking from the upper right corner of the heat plot above. I have included the code to produce these plots in R as they are fun to play around with

Even though we took the log of the frequency of occurance, the first few numbers still totally drown out any patterns in the higher numbers. So, let’s repeat the heat plot above with the first row (ie 256 numbers removed):

heat exc first row

Heat plot of frequency of occurrence of integers 257 to 65536

There’s definitely some interesting stuff going on here:

  • The diagonal line from the bottom right to the top left. This indicates a higher than average frequency sequences with an interval of 255, starting around 500. I have no idea why this is prevelant.
  • Those vertical lines represent sequences with an interval of 256. The most prevelant start with numbers 64, 128, 192 and 256 (ie, quarters of 256). This isn’t too surprising as these are products of base 2 arithmetic.
  • There are some diagonal lines with a gradient of -1/3. They are difficult to see at first – one begins at the top left. There seem to be another three, each a quarter of the way down. These indicate sequences with an interval of 253. Again, I have no idea where these come from.
  • There is an odd grouping around the 60000 mark. The grouping seems to be such that as i increases, N(i) decreases quite rapidly. But when i increases by 256, N(i) sharply increases again. This continues for a few cycles. Again, no idea why.
  • There is also quite a high frequency of numbers around the midpoint (i=32768). Again, probably not too surprising as this is likely to be a product of base 2 arithmetic (2^15 = 32768).

The way we have plotted the data will only show certain patterns. For example, diagonals will only appear for sequence intervals around 256. Let’s do the same plot for the integers 101 to 10000 (ie a 100X99 plot):

heat 100

Heat map of frequency of occurrence of the first 10000 integers (excluding 1-100)

I suspect this plot would give just as interesting patterns as the 256X256 plot above. However, we are only using about 1/6 as much of the data so the ‘resolution’ is lower. We can still see some points of interest:

  • There is a clear pattern of verticals with a spacing of 10. These represent sequences with an interval of multiples of 10.
  • There is a clear diagonal from the top left to the bottom right. This represents an interval of 99 between integers.
  • There is a distinct peak around the 2/3 point – it the number 6666.

 

Further work and a challenge

I’ve attached the raw data and code. This can be run in R to bring up 3D plots for all the heat maps. I don;t think they really give any more info but they are quite fun to play with.

I would like to do a heat plot for the first 1 million integers. However, as mentioned at the top, I can’t figure out how to write something which automatically downloads the correct values from the OEIS site. If anyone knows how to do this, feel free to message me – I’ve attached a spreadsheet of the URLs from which the data would be taken (all 1 million of them). Clicking on one takes you to the ‘results’ page for the relevant integer. If anyone can find a way to grab the result from that page and put it into the spreadsheet automatically, give me a shout – I’ll happily send £25 to whoever can do this!

Here is the R code for the graphics, you’ll need to install the ‘rgl’ library to use it:

https://www.dropbox.com/s/bhpcbnba8r8enzu/OEIS%20script.R?dl=0

Here is the data from the OEIS which the code uses:

https://www.dropbox.com/s/p3gens2stggq7nn/OEISNC.txt?dl=0

And here is my spreadsheet of URLs to search results for the first 1 million integers. £25 for whoever can make this automatically retrieve the search result for each URL into the adjacent cell!

https://www.dropbox.com/s/o746akd4hcrzm7f/Web%20search.xlsm?dl=0

 

Any comments on my interpretation? Seen something I have missed? Just let me know in the comments.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s