A quick study of open proxy detection data between January 2022 and January 2024 was conducted to evaluate the life expectancy and permanence of openly accessible Internet traffic rerouters (also often referred to as traffic origin anonymisers). The specific objective of the study was to determine basic criteria for identifying IPs which were statistically likely to continue serving as open access traffic anonymisers in the medium term (at least 1 week or longer).
The author operates scanning infrastructure for the detection of open proxies on the Internet. A data dump of verified detections over a 2-year time period (January 2022 to January 2024) was extracted for analysis. The data interval was kept relatively recent in order to ensure any derived conclusions were of contemporary relevance. Additionally, very short-lived (less than 24 hours of lifespan) IPs were removed from the dataset as these accounted for nearly 1/3 of the total number of IPs and would not have meaningfully contributed towards the objectives of the study. Post-filtering the dataset consisted of 130,714 unique proxy IPs (64,355 predominantly SOCKS and 69,257 predominantly HTTP, and a small number of hybrid cases where one proxy protocol was predominantly active on an IP but another protocol made a brief appearance on it). The author acknowledges that a form of ‘survivorship bias’ is likely present in the source data, but for the purposes of the study the bias should be insubstantial.
Unsurprisingly, many open proxy services have short lifespans. For an open proxy IP with a detection age of between 24 hours and less than a week, there is a 32.6% likelihood that the proxy goes permanently offline within one week. In this short-lived category the SOCKS proxy mortality is slightly higher at 36.2% vs HTTP proxy mortality at 29.2%. Interestingly, at the opposite end of the life expectancy projection of this “new open proxy” category the SOCKS proxies have a slightly higher likelihood of lasting another year compared to HTTP (9.65% vs 7.69%, respectively).
Once open proxies survive past the first week things start to settle. Looking at proxies with a slightly longer active detection age of 7-13 days, these have a 83.2% chance of remaining active for at least one additional week. If an open proxy IP manages to remain active for at least 30 days the chances improve even more; at this point the likelihood of its survival for at least one additional week go up to 93.6%, and the likelihood of the proxy surviving a whole additional month is 77.6%.
How would these semi-static registrations stack up against dynamic RBLs? Naturally real-world blackhole lists use custom expiration logic and methods to ensure good quality data, but for a hypothetical simple database which simply removes an IP after 1 week of inactivity the coverage would look – when mapped against the 2-year dataset – as per the table (Fig 2) below.
|Proxy activity age
|Probability of another week of activity
|Coverage vs simple RBL
In the above hypothetical comparison the coverage of a “semi static” list of IPs with at least 0-6 days of activity would obviously be the same as the RBL’s database (ignoring for the moment the fact that a real-world RBL is updated in real-time, and thus should always have more up-to-date information). The usefulness of a periodic IP or subnet dump might primarily be to serve as an optimisation – a hardcoded list of IPs does not require any additional network resources to index against, nor is it subject to temporary outages like a real-time DNS lookup across the Internet could be. And once a proxy has been active for at least 7 days the coverage % against the RBL should still be a reasonable 67.5% (i.e. over 2/3s of the likely active exploitation space). Beyond a 30-day lifespan the coverage tradeoff becomes a more questionable proposition; e.g. an open proxy which has been actively exploitable for 90 days might on one hand have a high (96.6%) probability of being active for at least another week, but on the other hand such proxies are relatively rare. The coverage of such a subset compared to the hypothetical RBL would be as little as 29.5%.
As an interesting aside, the actuarial odds for proxies lasting longer than a year appear to be noticeably better for SOCKS proxies compared to HTTP proxies in every single age group (see Figs 3 and 4 above). For example, once a proxy reaches an age of 30 days a SOCKS proxy has a 21.3% likelihood of surviving for at least one additional year, when the same probability for HTTP proxies is only 14.3%. Speculation on causation falls outside the scope of this particular study but could turn into an interesting area for future research.
Conclusions and Recommendations
As an active open proxy IP’s age increases its statistical life expectancy also generally improves. Thus proxy age can serve as a useful probabilistic marker and predictor of future exploitation activity.
For the purposes of mitigating fraud and abuse via Internet traffic anonymisers there could be a reasonable tradeoff in employing IP filters for IPv4 and IPv6 IPs which are likely to remain actively exploited in the near to medium-term future. Whilst reverse DNS blocklists (RBLs) and other sources of real-time information should be a top priority in any mitigation stack, the resources of such detection systems could likely be harnessed more optimally through probabilistic upstream filtering of unwanted traffic.
UPDATE: In a bid to facilitate such filtering experimentation two separate proxy IP data feeds (7 day active and 30 day active proxies) have been made available at https://github.com/mannfredcom/semi-static-proxy-ips/. Updated weekly, the CSV feeds contain verified “semi-static” proxy exit IP data (both IPv4 and IPv6) with accompanying information. Proxy entry protocol and port are omitted from the data for safety reasons. Additionally, CSV lists of subnets with an unusually large number of unique anonymiser IPs are provided, as these may be useful for some use cases.
The predictability of a proxy’s life expectancy could likely be improved substantially by modeling ASN and protocol-port-specific profiles.
Additionally, the current detection data likely contains by-products of one-off mass exploitation events and botnet-related activities. Accounting for such outliers would likely lead to a more generalisable prediction model.