About the data gatheredIn our latest scan of Dec 2022:
- About 4 billion IPv4 IP addresses were scanned (the entire IPv4 address space) on port 6667 and 6697
- In the IPv6 address space almost 7 million address where scanned (using the IPv6 hitlist) on port 6667 and 6697
- 7,492 servers were found.
- On 6,076 IRC servers our data gatherer was able to fully connect. Other servers rejected our link due to password or IP restrictions.
- Most statistics require us to fully connect, however statistics such as SSL/TLS and CAP could be gathered on all servers.
- Numbers from above are after deduplication: because IRC servers may listen on several IP addresses, thousands of duplicates had to be filtered out.
- Bouncers like psyBNC and BitlBee are also filtered out
About data deduplication and network correlation
Data deduplication and network correlation are important things that need to be done in order to get reliable statistics. Unfortunately they are also time consuming tasks.
Server deduplicationWe need to deduplicate data: IRC servers may listen on multiple IP addresses, some even in totally different netblocks (yes, really). It is very important that these duplicates are filtered out. Example: 10 servers even listened on 500+ IP addresses in total!
On the other hand server names and network names are not unique, so these cannot be used as decissive factors. For example in 2017 there were 900+ servers with the network names 'ROXnet' and 'debian'. These are default network names in configuration files, hence the high number of matches. They are obviously not two big networks.
Fortunately we have come up with a reliable way to deduplicate server data using things like: server names, network names, software version, uptime information and more. If all these are the same then it's extremely likely it's the same server. We filtered out thousands of duplicates using this method.
Network correlationNetworks running services are easy to correlate, hence the user counts on the Services page.
However, knowing which servers belong to which network on networks without central services is not so easy to detect. Some statistics such as user and channel counts can only be published after this is done, otherwise they would not be reliable. There's no room for error: if you fail to detect servers belonging to the same network you will very quickly count users and channels twice or more. This will cause counts to be off by tens of thousands, which is not acceptable.
The 2016 scan contains insufficient data to do proper network correlation. Servers on the same network turned out to have different network name. Other distinct networks shared the same network names. This wouldn't be much of a problem if not also a significant amount of servers blocked /MAP and /LINKS.
From the 2017 scan onward additional data was gathered, but no significant effort was made to do network correlation for networks without services.
SSL/TLS statisticsIRCStats started scanning and publishing results in 2016. However, the first year SSL/TLS data was gathered was 2017. The years 2017/2018/2019 the scan was done on port 6667 on the entire IPv4 address range. After finding the IRC servers that listened on port 6667, the servers listening on port 6667 were scanned for port 6697 and SSL/TLS data was gathered.
In the year 2020 it was done differently with the entire IPv4 internet scanned on BOTH port 6667 and 6697, therefore picking up IRC Servers that don't listen on port 6667 (IRC plaintext). This naturally has a strong effect on the SSL/TLS statistics in particular from the year 2020 (and onward), as it turned out that in Dec 2020 about 10% of the IRC Servers only listen on port 6697 and not on 6667.
IPv6Scanning the entire IPv6 address space is impossible (it would be 2 to the power of 128, so 340282366920938463463374607431768211456 IP addresses). Hence, we use the IPv6 hitlist plus we mass-resolve all IRC server hostnames from previous year and use those AAAA records. The first IPv6 scan happened in the year 2022.
I want to see more data / more graphs!Follow us on Twitter if you want to stay informed. Send us a tweet if you have a suggestion or request.
Note that only after network correlation is done, reliable statistics can be published with user and channel counts. This is hard, so we'll see when that happens.
Important: we will not give away data that may identify individual networks, servers or users.
Can I get a copy of the data set?These data sets are currently available. If you use them, please credit ircstats.org.
- Servers data (JSON): server software in use, with for each version: number of servers deployed
- CAP data (JSON): CAP capabilities offered by servers. The "parent array" contains numbers and percentages of the tokens in use on all servers. The arrays under each parent contain numbers and percentages by server software in use (this only includes servers that allowed us to fully connect).
- TLS protocol data (JSON): SSL/TLS protocol offered on port 6697 (if any). The "parent array" contains numbers and percentages of the SSL/TLS protocols available on all servers. The arrays under each parent contain numbers and percentages by server software in use (this only includes servers that allowed us to fully connect).
- TLS certificate data (JSON): Validity of SSL certificates offered on port 6697. The "parent array" contains numbers and percentages of the validity statistics of all servers that offer SSL/TLS on 6697. The arrays under each parent contain numbers and percentages by server software in use for servers with SSL/TLS on 6697 (this only includes servers that allowed us to fully connect).
- Services data (JSON): services package installed, with for each version: number of networks and number of users on these networks.