About the data gathered

In our latest scan of Dec 2019:

About data deduplication and network correlation

Data deduplication and network correlation are important things that need to be done in order to get reliable statistics. Unfortunately they are also time consuming tasks.

Server deduplication

We need to deduplicate data: IRC servers may listen on multiple IP addresses, some even in totally different netblocks (yes, really). It is very important that these duplicates are filtered out. Example: 10 servers even listened on 500+ IP addresses in total!
On the other hand server names and network names are not unique, so these cannot be used as decissive factors. Example: There are 900+ servers with the network names 'ROXnet' and 'debian'. These are default network names in configuration files, hence the high number of matches. They are obviously not two big networks.
Fortunately we have come up with a reliable way to deduplicate server data using things like: server names, network names, software version, uptime information and more. If all these are the same then it's extremely likely it's the same server. We filtered out thousands of duplicates using this method.

Network correlation

Knowing which servers belong to which network is not always easy to detect. Some statistics such as user and channel counts can only be published after this is done, otherwise they would not be reliable. There's no room for error: if you fail to detect servers belonging to the same network you will very quickly count users and channels twice or more. This will cause counts to be off by tens of thousands, which is not acceptable.
The 2016 scan contains insufficient data to do proper network correlation. Servers on the same network turned out to have different network name. Other distinct networks shared the same network names. This wouldn't be much of a problem if not also a significant amount of servers blocked /MAP and /LINKS.
In the 2017 scan we gathered additional data which should hopefully help us in these cases.
Networks running services were easy to correlate, hence the user counts on the Services page.

I want to see more data / more graphs!

Follow us on Twitter if you want to stay informed. Send us a tweet if you have a suggestion or request.
Note that only after network correlation is done, reliable statistics can be published with user and channel counts. This is hard, so we'll see when that happens.
Important: we will not give away data that may identify individual networks, servers or users.

Can I get a copy of the data set?

These data sets are currently available. If you use them, please credit ircstats.org. We will not give out a copy of the raw data. See the About page for our strict rules on privacy.