âpackageRankâ is an R
package that helps put package download counts into context. It does so
via two core functions, cranDownloads()
and packageRank()
, a set of
filters that reduce download count inflation, and a host of other
assorted functions.
You can read more about the package in the sections below:

I Download Counts describes how
cranDownloads()
givescranlogs::cran_downloads()
a more userfriendly interface and makes visualizing those data easy via its generic Rplot()
method. 
II Download Rank Percentiles
describes how
packageRank()
makes use of rank percentiles. This nonparametric statistic computes the percentage of packages that with fewer downloads than yours: a package is in the 74th percentile has more downloads than 74% of packages. This facilitates comparison and helps you locate your package in the overall distribution of CRAN package downloads. 
III Inflation Filters describes four
filter functions that remove software and behavioral artifacts that
inflate nominal download counts. This functionality is available in
packageRank()
andpackageLog()
. 
IV Availability of Results discusses
when results become available, how to use
logInfo()
to check the availability of results, and the effect of time zones. 
V Data Fix A discusses two problems with download
counts. The first stems from problems with the logs from the end of
2012 through the beginning of 2013. These are fixed in
fixDate_2012()
andfixCranlogs()
. 
VI Data Fix B discusses a problem with
âcranlogsâ that doubles
or triples the number of R application downloads between 20230913
and 20231002. This is fixed in
fixRCranlogs()
.  VII Data Note discusses the spike in the download of the Windows version of the R application on Sundays and Wednesdays between 06 November 2022 and 19 March 2023.

VIII et cetera discusses country code toplevel
domains (e.g.,
countryPackage()
andpackageCountry()
), the use of memoization and the internet connection time out problem.
To install âpackageRankâ from CRAN:
install.packages("packageRank")
To install the development version from GitHub:
# You may need to first install 'remotes' via install.packages("remotes").
remotes::install_github("lindbrook/packageRank", build_vignettes = TRUE)
cranDownloads()
uses all the same arguments as
cranlogs::cran_downloads()
:
cranlogs::cran_downloads(packages = "HistData")
> date count package
> 1 20200501 338 HistData
The only difference is that cranDownloads()
adds four features:
cranDownloads(packages = "GGplot2")
## Error in cranDownloads(packages = "GGplot2") :
## GGplot2: misspelled or not on CRAN.
cranDownloads(packages = "ggplot2")
> date count cumulative package
> 1 20200501 56357 56357 ggplot2
Note that his also works for inactive or âretiredâ packages in the
Archive:
cranDownloads(packages = "vr")
## Error in cranDownloads(packages = "vr") :
## vr: misspelled or not on CRAN/Archive.
cranDownloads(packages = "VR")
> date count cumulative package
> 1 20200501 11 11 VR
With cranlogs::cran_downloads()
, you specify a time frame using the
from
and to
arguments. The downside of this is that you must use
âyyyymmddâ. For convenienceâs sake, cranDownloads()
also allows you
to use âyyyymmâ or yyyy (âyyyyâ also works).
Letâs say you want the download counts for
âHistDataâ for February
2020. With cranlogs::cran_downloads()
, youâd have to type out the
whole date and remember that 2020 was a leap year:
cranlogs::cran_downloads(packages = "HistData", from = "20200201",
to = "20200229")
With cranDownloads()
, you can just specify the year and month:
cranDownloads(packages = "HistData", from = "202002", to = "202002")
Letâs say you want the download counts for
ârstanâ for 2020. With
cranlogs::cran_downloads()
, youâd type something like:
cranlogs::cran_downloads(packages = "rstan", from = "20220101",
to = "20221231")
With cranDownloads()
, you can use:
cranDownloads(packages = "rstan", from = 2020, to = 2020)
or
cranDownloads(packages = "rstan", from = "2020", to = "2020")
These additional date formats help to create convenient shortcuts. Letâs
say you want the yeartodate download counts for
ârstanâ. With
cranlogs::cran_downloads()
, youâd type something like:
cranlogs::cran_downloads(packages = "rstan", from = "20230101",
to = Sys.Date()  1)
With cranDownloads()
, you can just pass the current year to
from =
:
cranDownloads(packages = "rstan", from = 2023)
And if you wanted the entire download history, pass the current year to
to =
:
cranDownloads(packages = "rstan", to = 2023)
Note that the Posit/RStudio logs begin on 01 October 2012.
cranDownloads(packages = "HistData", from = "20190115",
to = "20190135")
## Error in resolveDate(to, type = "to") : Not a valid date.
cranDownloads(packages = "HistData", when = "lastweek")
> date count cumulative package
> 1 20200501 338 338 HistData
> 2 20200502 259 597 HistData
> 3 20200503 321 918 HistData
> 4 20200504 344 1262 HistData
> 5 20200505 324 1586 HistData
> 6 20200506 356 1942 HistData
> 7 20200507 324 2266 HistData
The âspell checkâ or validation of packages described above, requires
some additional background downloads. While those data are cached via
the âmeomoiseâ package, this does add time the first time
cranDownloads()
is run. For faster results, which bypass those
features, set pro.mode = TRUE
. The downside is that youâll see zero
downloads for packages on dates before theyâre published on CRAN, youâll
see zero downloads for misspelled/nonexistent packages and you canât
just use the to =
argument alone.
For example, âpackageRankâ was first published on CRAN on 20190516 
you can verify this via packageHistory("packageRank")
. If you use
cranlogs::cran_downloads()
or cranDownloads(pro.mode = TRUE)
before
that date, youâll see zero downloads on dates before that time:
cranDownloads("packageRank", from = "20190510", to = "20190516", pro.mode = TRUE)
> date count cumulative package
> 1 20190510 0 0 packageRank
> 2 20190511 0 0 packageRank
> 3 20190512 0 0 packageRank
> 4 20190513 0 0 packageRank
> 5 20190514 0 0 packageRank
> 6 20190515 0 0 packageRank
> 7 20190516 68 68 packageRank
Youâll notice this particularly when one of the packages youâre including newer packages in cranDownloads().
If you misspell a package :
cranDownloads("vr", from = "20190510", to = "20190516", pro.mode = TRUE)
> date count cumulative package
> 1 20190510 0 0 vr
> 2 20190511 0 0 vr
> 3 20190512 0 0 vr
> 4 20190513 0 0 vr
> 5 20190514 0 0 vr
> 6 20190515 0 0 vr
> 7 20190516 0 0 vr
If you just use to =
without a value for from =
, youâll get an
error:
cranDownloads(to = 2024, pro.mode = TRUE)
Error: You must also provide a date for "from".
cranDownloads()
makes visualizing package downloads easy by using
plot()
:
plot(cranDownloads(packages = "HistData", from = "2019", to = "2019"))
If you pass a vector of package names for a single day, plot()
returns
a dotchart:
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "20200301", to = "20200301"))
If you pass a vector of package names for multiple days, plot()
uses
âggplot2â facets:
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "20200320"))
To plot those data in a single frame, set multi.plot = TRUE
:
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "20200320"), multi.plot = TRUE)
To plot those data in separate plots on the same scale, set
graphics = "base"
and youâll be prompted for each plot:
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "20200320"), graphics = "base")
To do the above on separate, independent scales, set same.xy = FALSE
:
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "20200320"), graphics = "base", same.xy = FALSE)
To use the base 10 logarithm of the download count in a plot, set
log.y = TRUE
:
plot(cranDownloads(packages = "HistData", from = "2019", to = "2019"),
log.y = TRUE)
Note that for the sake of the plot, zero counts are replaced by ones so
that the logarithm can be computed (This does not affect the data
returned by cranDownloads()
).
cranlogs::cran_download(packages = NULL)
computes the total number of
package downloads from CRAN. You can plot
these data by using:
plot(cranDownloads(from = 2019, to = 2019))
Note that I sometimes get a âGateway Timeout (HTTP 504)â error when
using this function for long time periods. This may be due to traffic
but alternatively could be related to âcranlogsâ issue
#56. As a workaround,
annualDownloads()
downloads the data for each year individually and
then reassembles them into a single data frame. This, of course, takes
more time but seems to be more reliable.
plot(annualDownloads(start.yr = 2013, end.yr = 2023))
Note that in the plot above, three historical outlier days are
highlighted: â20141117â, â20181021â and â20200229â. The first was
due to a disproportionate download of six packages: âBayHazâ, âclhsâ,
âGPseqâ, âOPIâ, âYaleToolkitâ and âsurvsimâ. The second date was due to
downloads of âtidyverseâ (~700x the second place package
âRcpp`'). The third is possibly related to some kind of scripting error that overlooked the fact that it was a leap day. You can validate this using
packageLog()`.
cranlogs::cran_download(packages = "R")
computes the total number of
downloads of the R application (note that you can only use âRâ or a
vector of packages names, not both!). You can plot these data by using:
plot(cranDownloads(packages = "R", from = 2019, to = 2019))
If you want the total count of R downloads, set r.total = TRUE
:
plot(cranDownloads(packages = "R", from = 2019, to = 2019), r.total = TRUE)
Note that since Sunday 06 November 2022 and Wednesday, 18 January 2023, thereâve been spikes of downloads of the Windows version of R on Sundays and Wednesdays (details below in R Windows Sunday and Wednesday downloads).
To add a smoother to your plot, use smooth = TRUE
:
plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
smooth = TRUE)
With graphs that use âggplot2â, se = TRUE
will add a confidence
interval:
plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
from = "2020", to = "20200320"), smooth = TRUE, se = TRUE)
In general, loess is the chosen smoother. Note that with base graphics,
lowess is used when there are 7 or fewer observations. Thus, to control
the degree of smoothness, youâll typically use the span
argument (the
default is span = 0.75). With base graphics with 7 or fewer
observations, you control the degree of smoothness using the f
argument (the default is f = 2/3):
plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
from = "2020", to = "20200320"), smooth = TRUE, span = 0.75)
plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
from = "2020", to = "20200320"), smooth = TRUE, graphics = "ggplot2",
span = 0.33)
To annotate a graph with a packageâs release dates (base graphics only):
plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
package.version = TRUE)
To annotate a graph with R release dates:
plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
r.version = TRUE)
To plot growth curves, set statistic = "cumulative"
:
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "20200320"), statistic = "cumulative",
multi.plot = TRUE, points = FALSE)
To visualize a packageâs downloads relative to âallâ other packages over time:
plot(cranDownloads(packages = "HistData", from = "2020", to = "20200320"),
population.plot = TRUE)
This longitudinal view plots the date (xaxis) against the base 10 logarithm of the selected packageâs download counts (yaxis). To get a sense of how the selected packageâs performance stacks up against âallâ other packages, a set of smoothed curves representing a stratified random sample of packages is plotted in gray in the background (this is the âtypicalâ pattern of downloads on CRAN for the selected time period).^{1}
The default unit of observation for both cranDownloads()
and
cranlogs::cran_dowanlods()
is the day. The graph below plots the daily
downloads for âcranlogsâ
from 01 January 2022 through 15 April 2022.
plot(cranDownloads(packages = "cranlogs", from = 2022, to = "20220415"))
To view the data from a less granular perspective, change
plot.cranDownloads()âs unit.observation
argument from âdayâ to âweekâ,
âmonthâ, or âyearâ.
The graph below plots the data aggregated by month (with an added smoother):
plot(cranDownloads(packages = "cranlogs", from = 2022, to = "20220415"),
unit.observation = "month", smooth = TRUE, graphics = "ggplot2")
Three things to note. First, if the last/current month (far right) is still inprogress (itâs not yet the end of the month), that observation will be split in two: one point for the inprogress total (empty black square), another for the estimated total (empty red circle). The estimate is based on the proportion of the month completed. In the example above, the 635 observed downloads from April 1 through April 15 translates into an estimate of 1,270 downloads for the entire month (30 / 15 * 635). Second, if a smoother is included, it will only use âcompleteâ observations, not inprogress or estimated data. Third, all points are plotted along the xaxis on the first day of the month.
The graph below plots the data aggregated by week (weeks begin on Sunday).
plot(cranDownloads(packages = "cranlogs", from = 2022, to = "20220615"),
unit.observation = "week", smooth = TRUE)
Four things to note. First, if the first week (far left) is incomplete (the âfromâ date is not a Sunday), that observation will be split in two: one point for the observed total on the start date (gray empty square) and another point for the backdated total. Backdating involves completing the week by pushing the nominal start date back to include the previous Sunday (blue asterisk). In the example above, the nominal start date (01 January 2022) is moved back to include data through the previous Sunday (26 December 2021). This is useful because with a weekly unit of observation the first observation is likely to be truncated and would not give the most representative picture of the data. Second, if the last week (far right) is inprogress (the âtoâ date is not a Saturday), that observation will be split in two: the observed total (gray empty square) and the estimated total based on the proportion of week completed (red empty circle). Third, just like the monthly plot, smoothers only use complete observations, including backdated data but excluding inprogress and estimated data. Fourth, with the exception of first weekâs observed count, which is plotted at its nominal date, points are plotted along the xaxis on Sundays, the first day of the week.
For what itâs worth, below are my goto commands for graphs. They take
advantage of RStudio IDEâs plot history panel, which allows you to cycle
through and compare graphs. Typically, Iâll look at the data for the
last year or so at the three available units of observation: day, week
and month. I use base graphics, via graphics = "base"
, to take
advantage of prompts and ânicerâ axes annotation. This also allows me to
easily add graphical elements afterwards as needed, e.g.,
abline(h = 100, lty = "dotted")
.
plot(cranDownloads(packages = c("cholera", "packageRank"), from = 2023),
graphics = "base", package.version = TRUE, smooth = TRUE,
unit.observation = "day")
plot(cranDownloads(packages = c("cholera", "packageRank"), from = 2023),
graphics = "base", package.version = TRUE, smooth = TRUE,
unit.observation = "week")
# Note that I disable smoothing for monthly data
plot(cranDownloads(packages = c("cholera", "packageRank"), from = 2023),
graphics = "base", package.version = TRUE, smooth = FALSE,
unit.observation = "month")
Perhaps the biggest downside of using cranDownload(pro.mode = TRUE) is that you might draw mistaken inferences from plotting the data since it adds false zeroes to your data.
Using the example of âpackageRankâ, which was published on 20190516:
plot(cranDownloads("packageRank", from = "201905", to = "201905",
pro.mode = TRUE), smooth = TRUE)
plot(cranDownloads("packageRank", from = "201905", to = "201905",
pro.mode = FALSE), smooth = TRUE)
After spending some time with nominal download counts, the âcompared to what?â question will come to mind. For instance, consider the data for the âcholeraâ package from the first week of March 2020:
plot(cranDownloads(packages = "cholera", from = "20200301",
to = "20200307"))
Do Wednesday and Saturday reflect surges of interest in the package or surges of traffic to CRAN? To put it differently, how can we know if a given download count is typical or unusual?
To answer these questions, we can start by looking at the total number of package downloads:
plot(cranDownloads(from = "20200301", to = "20200307"))
Here we see that thereâs a big difference between the work week and the weekend. This seems to indicate that the download activity for âcholeraâ on the weekend seems high. Moreover, the Wednesday peak for âcholeraâ downloads seems higher than the midweek peak of total downloads.
One way to better address these observations is to locate your packageâs
download counts in the overall frequency distribution of download
counts. âcholeraâ allows you to do so via packageDistribution()
. Below
are the distributions of logarithm of download counts for Wednesday and
Saturday. Each vertical segment (along the xaxis) represents a download
count. The height of a segment represents that download countâs
frequency. The location of
âcholeraâ in the
distribution is highlighted in red.
plot(packageDistribution(package = "cholera", date = "20200304"))
plot(packageDistribution(package = "cholera", date = "20200307"))
While these plots give us a better picture of where âcholeraâ is located, comparisons between Wednesday and Saturday are still impressionistic: all we can confidently say is that the download counts for both days were greater than the mode.
To facilitate interpretation and comparison, I use the rank percentile of a download count instead of the simple nominal download count. This nonparametric statistic tells you the percentage of packages that had fewer downloads. In other words, it gives you the location of your package relative to the locations of all other packages. More importantly, by rescaling download counts to lie on the bounded interval between 0 and 100, rank percentiles make it easier to compare packages within and across distributions.
For example, we can compare Wednesday (â20200304â) to Saturday (â20200307â):
packageRank(package = "cholera", date = "20200304")
> date packages downloads rank percentile
> 1 20200304 cholera 38 5,556 of 18,038 67.9
On Wednesday, we can see that âcholeraâ had 38 downloads, came in 5,556th place out of the 18,038 different packages downloaded, and earned a spot in the 68th percentile.
packageRank(package = "cholera", date = "20200307")
> date packages downloads rank percentile
> 1 20200307 cholera 29 3,061 of 15,950 80
On Saturday, we can see that âcholeraâ had 29 downloads, came in 3,061st place out of the 15,950 different packages downloaded, and earned a spot in the 80th percentile.
So contrary to what the nominal counts tell us, one could say that the interest in âcholeraâ was actually greater on Saturday than on Wednesday.
To compute rank percentiles, I do the following. For each package, I tabulate the number of downloads and then compute the percentage of packages with fewer downloads. Here are the details using âcholeraâ from Wednesday as an example:
pkg.rank < packageRank(packages = "cholera", date = "20200304")
downloads < pkg.rank$freqtab
round(100 * mean(downloads < downloads["cholera"]), 1)
> [1] 67.9
To put it differently:
(pkgs.with.fewer.downloads < sum(downloads < downloads["cholera"]))
> [1] 12250
(tot.pkgs < length(downloads))
> [1] 18038
round(100 * pkgs.with.fewer.downloads / tot.pkgs, 1)
> [1] 67.9
In the example above, 38 downloads puts âcholeraâ in 5,556th place among 18,038 observed packages. This rank is ânominalâ because itâs possible that multiple packages can have the same number of downloads. As a result, a packageâs nominal rank but not its rank percentile can be affected by its name. For example, because packages with the same number of downloads are sorted in alphabetical order, âcholeraâ benefits from the fact that it is 31st in the list of 263 packages with 38 downloads:
pkg.rank < packageRank(packages = "cholera", date = "20200304")
downloads < pkg.rank$freqtab
which(names(downloads[downloads == 38]) == "cholera")
> [1] 31
length(downloads[downloads == 38])
> [1] 263
To visualize packageRank()
, use plot()
.
plot(packageRank(packages = "cholera", date = "20200304"))
plot(packageRank(packages = "cholera", date = "20200307"))
These graphs above, which are customized here to be on the same scale, plot the rank order of packagesâ download counts (xaxis) against the logarithm of those counts (yaxis). It then highlights (in red) a packageâs position in the distribution along with its rank percentile and download count. In the background, the 75th, 50th and 25th percentiles are plotted as dotted vertical lines. The package with the most downloads, âmagrittrâ in both cases, is at top left (in blue). The total number of downloads is at the top right (in blue).
âcranlogsâ computes the number of package downloads by simply counting log entries. While straightforward, this approach can run into problems. Putting aside the question of whether package dependencies should be counted, what I have in mind here is what I believe to be two types of âinvalidâ log entries. The first, a software artifact, stems from entries that are smaller, often orders of magnitude smaller, than a packageâs actual binary or source file. The second, a behavioral artifact, emerges from efforts to download all of CRAN. In both cases, a reliance on nominal counts will give you an inflated sense of the degree of interest in your package. For those interested, an early but detailed analysis and discussion of both types of inflation is included as part of this Rhub blog post.
When looking at package download logs, the first thing youâll notice are wrongly sized log entries. They come in two sizes. The âsmallâ entries are approximately 500 bytes in size. The âmediumâ entries vary in size, falling somewhere between a âsmallâ entry and a full download (i.e., âsmallâ <= âmediumâ <= full download). âSmallâ entries manifest themselves as standalone entries, paired with a full download, or as part of a triplet along side a âmediumâ and a full download. âMediumâ entries manifest themselves as either standalone entries or as part of a triplet.
The example below illustrates a triplet:
packageLog(date = "20200701")[4:6, (4:6)]
> date time size package version country ip_id
> 3998633 20200701 07:56:15 99622 cholera 0.7.0 US 4760
> 3999066 20200701 07:56:15 4161948 cholera 0.7.0 US 4760
> 3999178 20200701 07:56:15 536 cholera 0.7.0 US 4760
The âmediumâ entry is the first observation (99,622 bytes). The full download is the second entry (4,161,948 bytes). The âsmallâ entry is the last observation (536 bytes). At a minimum, what makes a triplet a triplet (or a pair a pair) is that all members share system configuration (e.g.Â IP address, etc.) and have identical or adjacent time stamps.
To deal with the inflationary effect of âsmallâ entries, I filter out observations smaller than 1,000 bytes (the smallest package on CRAN appears to be âLifeInsuranceContractsâ, whose source file weighs in at 1,100 bytes). âMediumâ entries are harder to handle. I remove them using a filter functions that looks up a packageâs actual size.
While wrongly sized entries are fairly easy to spot, seeing the effect of efforts to download all of CRAN require a change of perspective. While details and further evidence can be found in the Rhub blog post mentioned above, Iâll illustrate the problem with the following example:
packageLog(packages = "cholera", date = "20200731")[8:14, (4:6)]
> date time size package version country ip_id
> 132509 20200731 21:03:06 3797776 cholera 0.2.1 US 14
> 132106 20200731 21:03:07 4285678 cholera 0.4.0 US 14
> 132347 20200731 21:03:07 4109051 cholera 0.3.0 US 14
> 133198 20200731 21:03:08 3766514 cholera 0.5.0 US 14
> 132630 20200731 21:03:09 3764848 cholera 0.5.1 US 14
> 133078 20200731 21:03:11 4275831 cholera 0.6.0 US 14
> 132644 20200731 21:03:12 4284609 cholera 0.6.5 US 14
Here, we see that seven different versions of the package were downloaded as a sequential bloc. A little digging shows that these seven versions represent all versions of âcholeraâ available on that date:
packageHistory(package = "cholera")
> Package Version Date Repository
> 1 cholera 0.2.1 20170810 Archive
> 2 cholera 0.3.0 20180126 Archive
> 3 cholera 0.4.0 20180401 Archive
> 4 cholera 0.5.0 20180716 Archive
> 5 cholera 0.5.1 20180815 Archive
> 6 cholera 0.6.0 20190308 Archive
> 7 cholera 0.6.5 20190611 Archive
> 8 cholera 0.7.0 20190828 CRAN
While there are âlegitimateâ reasons for downloading past versions (e.g., research, containerbased software distribution, etc.), Iâd argue that examples like the above are âfingerprintsâ of efforts to download CRAN. While this is not necessarily problematic, it does mean that when your package is downloaded as part of such efforts, that download is more a reflection of an interest in CRAN itself (a collection of packages) than of an interest in your package per se. And since one of the uses of counting package downloads is to assess interest in your package, it may be useful to exclude such entries.
To do so, I try to filter out these entries in two ways. The first identifies IP addresses that download âtoo manyâ packages and then filters out campaigns, large blocs of downloads that occur in (nearly) alphabetical order. The second looks for campaigns not associated with âgreedyâ IP addresses and filters out sequences of past versions downloaded in a narrowly defined time window.
To get an idea of how inflated your packageâs download count may be, use
filteredDownloads()
. Below are the results for âggplot2â for 15
September 2021.
filteredDownloads(package = "ggplot2", date = "20210915")
> date package downloads filtered.downloads delta inflation
> 1 20210915 ggplot2 113842 111662 2180 1.95 %
While there were 113,842 nominal downloads, applying all the filters reduced that number to 111,662, an inflation of 1.95%.
Excluding the time it takes to download the log file (typically the bulk of the computation time), the above example take approximate 15 additional seconds to run on a single core on a 3.1 GHz DualCore Intel Core i5 processor.
There are 4 filters. You can control them using the following arguments (listed in order of application):

ip.filter
: removes campaigns of âgreedyâ IP addresses. 
small.filter
: removes entries smaller than 1,000 bytes. 
sequence.filter
: removes blocs of past versions. 
size.filter
: removes entries smaller than a packageâs binary or source file.
For filteredDownloads()
, they are all on by default. For
packageLog()
and packageRank()
, they are off by default. To apply
them, simply set the argument for the filter you want to TRUE:
packageRank(package = "cholera", small.filter = TRUE)
Alternatively, for packageLog()
and packageRank()
you can simply set
all.filters = TRUE
.
packageRank(package = "cholera", all.filters = TRUE)
Note that the all.filters = TRUE
is contextual. Depending on the
function used, youâll either get the CRANspecific or the
packagespecific set of filters. The former sets ip.filter = TRUE
and
size.filter = TRUE
; it works independently of packages at the level of
the entire log. The latter sets sequence.filter = TRUEand
size.filter
TRUE`; it relies on package specific information (e.g., size of source
or binary file).
Ideally, weâd like to use both sets. However, the packagespecific set
is computationally expensive because they need to be applied
individually to all packages in the log, which can involve tens of
thousands of packages. While not unfeasible, currently this takes a long
time. For this reason, when all.filters = TRUE
, packageRank()
,
ipPackage()
, countryPackage()
, countryDistribution()
and
packageDistribution()
use only CRAN specific filters while
packageLog()
, packageCountry()
, and filteredDownloads()
use both
CRAN and package specific filters.
To understand when results become available, you need to be aware that âpackageRankâ has two upstream, online dependencies. The first is Posit/RStudioâs CRAN package download logs, which record traffic to the â0Cloudâ mirror at cloud.rproject.org (formerly Posit/RStudioâs CRAN mirror). The second is GĂĄbor CsĂĄrdiâs âcranlogsâ R package, which uses those logs to compute the download counts of both the R application and R packages.
The CRAN package download logs for the previous day are typically posted by 17:00 UTC. The results for âcranlogsâ usually become available soon thereafter (sometimes as much as a day later).
Occasionally problems with âtodayâsâ data can emerge due to the upstream dependencies (illustrated below).
CRAN Download Logs > 'cranlogs' > 'packageRank'
If thereâs a problem with the logs (e.g., theyâre not posted on time), both âcranlogsâ and âpackageRankâ will be affected. If this happens, youâll see things like an unexpected zero count(s) for your package(s) (actually, youâll see a zero download count for both our package and for all of CRAN), data from âyesterdayâ, or a âLog is not (yet) on the serverâ error message.
'cranlogs' > packageRank::cranDownloads()
If thereâs a problem with
âcranlogsâ but not with
the logs, only
packageRank::cranDownalods()
will be affected. In that case, you might
get a warning that only âpreviousâ results will be used. All other
âpackageRankâ
functions should work since they either directly access the logs or use
some other source. Usually, these errors resolve themselves the next
time the underlying scripts are run (âtomorrowâ, if not sooner).
To check the status of the download logs and âcranlogsâ, use
logInfo()
. This function checks whether 1) âtodayâsâ log is posted on
Posit/RStudioâs server and 2) âtodayâsâ results have been computed by
âcranlogsâ.
logInfo()
$`Today's log/result`
[1] "20230201"
$`Today's log posted?`
[1] "Yes"
$`Today's results on 'cranlogs'?`
[1] "No"
$status
[1] "Today's log is typically posted by 09:00 PST (01 Feb 17:00 GMT)."
Because youâre typically interested in todayâs log file, another thing that affects availability is your time zone. For example, letâs say that itâs 09:01 on 01 January 2021 and you want to compute the rank percentile for âergmâ for the last day of 2020. You might be tempted to use the following:
packageRank(packages = "ergm")
However, depending on where you make this request, you may not get the data you expect. In Honolulu, USA, you will. In Sydney, Australia you wonât. The reason is that youâve somehow forgotten a key piece of trivia: Posit/RStudio typically posts yesterdayâs log around 17:00 UTC the following day.
The expression works in Honolulu because 09:01 HST on 01 January 2021 is 19:01 UTC 01 January 2021. So the log you want has been available for 2 hours. The expression fails in Sydney because 09:01 AEDT on 01 January 2021 is 31 December 2020 22:00 UTC. The log you want wonât actually be available for another 19 hours.
To make life a little easier, âpackageRankâ does two things. First, when the log for the date you want is not available (due to time zone rather than server issues), youâll just get the last available log. If you specified a date in the future, youâll either get an error message or a warning with an estimate of when the log you want should be available.
Using the Sydney example and the expression above, youâd get the results for 30 December 2020:
packageRank(packages = "ergm")
> date packages downloads rank percentile
> 1 20201230 ergm 292 873 of 20,077 95.6
If you had specified the date, youâd get an additional warning:
packageRank(packages = "ergm", date = "20210101")
> date packages downloads rank percentile
> 1 20201230 ergm 292 873 of 20,077 95.6
Warning message:
20201231 log arrives in appox. 19 hours at 02 Jan 04:00 AEDT. Using last available!
Keep in mind that 17:00 UTC is not a hard deadline. Barring server issues, the logs are usually posted a little before that time. I donât know when the script starts but the posting time seems to be a function of the number of entries: closer to 17:00 UTC when there are more entries (e.g., weekdays); before 17:00 UTC when there are fewer entries (e.g., weekends). Again, barring server issues, the âcranlogsâ results are usually available shortly after 17:00 UTC.
Hereâs what youâd see using the Honolulu example:
logInfo(show.available = TRUE)
$`Today's log/result`
[1] "20201231"
$`Today's log posted?`
[1] "Yes"
$`Today's results on 'cranlogs'?`
[1] "Yes"
$`Available log/result`
[1] "Posit/RStudio (20201231); 'cranlogs' (20201231)."
$status
[1] "Everything OK."
The functions uses your local time zone, which depends on Râs ability to
compute your local time and time zone (e.g., Sys.time()
and
Sys.timezone()
). My understanding is that there may be operating
system or platform specific issues that could undermine this.
âpackageRankâ fixes two data problems. The first addresses a problem that affects logs when the data were first collected (late 2012 through the beginning of 2013). To understand the problem, we need to be know that the Posit/RStudio download logs, which begin on 01 October 2012, are stored as separate files with a name/URL that embeds the date:
http://cranlogs.rstudio.com/2022/20220101.csv.gz
For the logs in question, this convention was broken in three ways: 1) some logs are effectively duplicated (same log, multiple names), 2) at least one is mislabeled and 3) the logs from 13 October through 28 December are offset by +3 days (e.g., the file with the name/URL â20121201â contains the log for â20121128â). As a result, we get erroneous download counts and we actually lose the last three logs of 2012. Details are available here.
Unsurprisingly, all this leads to erroneous download counts. What is surprising is that these errors are compounded by how âcranlogsâ computes package downloads.
âpackageRankâ
functions like packageRank()
and packageLog()
are affected by the
second and third defects (mislabeled and offset logs) because they
access logs via their filename/URL.
fixDate_2012()
addresses the problem by remapping problematic logs so that you get the
log you expect.
While unaffected by the second and third defects, functions that rely on
cranlogs::cran_download()
(e.g.,
packageRank::cranDownloads()
`,
âadjustedcranlogsâ
and âdlstatsâ) are
susceptible to the first defect (duplicate names). My understanding is
that this is because
âcranlogsâ uses the date
in a log rather than the filename/URL to retrieve logs.
To put it differently, âcranlogsâ canât detect multiple instances of logs with the same date. I found 3 logs with duplicate filename/URLs, and 5 additional instances of overcounting (including one of tripling).
fixCranlogs()
addresses this overcounting problem by recomputing the download counts
using the actual log(s) when any of the eight problematic dates are
requested. Details about the 8 days and fixCranlogs()
can be found
here.
The second data problem is of more recent vintage. From 20230913
through 20231002, the download counts for the R application returned
by cranlogs::cran_downloads(packages = "R")
, is, with two exceptions,
twice what one would expect when looking at the actual log(s). The two
exceptions are: 1) 20230928 where the counts are identical but for a
ârounding errorâ possibly due to an NA and 2) 20230930 where there is
actually a threefold difference.
Here are the relevant ratios of counts comparing âcranlogsâ results with counts based on the underlying logs:
20230912 20230913 20230914 20230915 20230916 20230917 20230918 20230919
osx 1 2 2 2 2 2 2 2
src 1 2 2 2 2 2 2 2
win 1 2 2 2 2 2 2 2
20230920 20230921 20230922 20230923 20230924 20230925 20230926 20230927
osx 2 2 2 2 2 2 2 2
src 2 2 2 2 2 2 2 2
win 2 2 2 2 2 2 2 2
20230928 20230929 20230930 20231001 20231002 20231003
osx 1.000000 2 3 2 2 1
src 1.000801 2 3 2 2 1
win 1.000000 2 3 2 2 1
Details and code for replication can be found in issue
#69.
fixRCranlogs()
corrects the problem
Note that there was a similar issue for package download counts around the same period but that is now fixed in âcranlogsâ. For details, see issue #68
The graph above for R downloads shows the daily downloads of the R application broken down by platform (Mac, Source, Windows). In it, you can see the typical pattern of midweek peaks and weekend troughs.
Between 06 November 2022 and 19 March 2023, this pattern was broken. On Sundays (06 November 2022  19 March 2023) and Wednesdays (18 January 2023  15 March 2023), there were noticeable, repeated ordersofmagnitude spikes in the daily downloads of the Windows version of R.
plot(cranDownloads("R", from = "20221006", to = "20230414"))
axis(3, at = as.Date("20221106"), labels = "20221106", cex.axis = 2/3,
padj = 0.9)
axis(3, at = as.Date("20230319"), labels = "20230319", cex.axis = 2/3,
padj = 0.9)
abline(v = as.Date("20221106"), col = "gray", lty = "dotted")
abline(v = as.Date("20230319"), col = "gray", lty = "dotted")
These download spikes did not seem to affect either the Mac or Source versions. I show this in the graphs below. Each plot, which is individually scaled, breaks down the data in the graph above by day (Sunday or Wednesday) and platform.
The key thing is to compare the data in the period bounded by vertical dotted lines with the data before and after. If a Sunday or Wednesday is ordersofmagnitude unusual, I plot that day with a filled rather than an empty circle. Only Windows, the final two graphs below, earn this distinction.
For those interested in directly using the , this section describes some issues that may be of use.
While the IP addresses in the Posit/RStudio
logs are anonymized, packageCountry()
and countryPackage()
the logs include ISO country codes or top level
domains (e.g., AT, JP, US).
Note that coverage extends to only about 85% of observations (approximately 15% country codes are NA), and that there seems to be a a couple of typos for country codes: âA1â (A + number one) and âA2â (A + number 2). According to Posit/RStudioâs documentation, this coding was done using MaxMindâs free database, which no longer seems to be available and may be a bit out of date.
To avoid the bottleneck of downloading multiple log files,
packageRank()
is currently limited to individual calendar dates. To
reduce the bottleneck of redownloading logs, which can approach 100 MB,
âpackageRankâ makes
use of memoization via the
âmemoiseâ package.
Hereâs relevant code:
fetchLog < function(url) data.table::fread(url)
mfetchLog < memoise::memoise(fetchLog)
if (RCurl::url.exists(url)) {
cran_log < mfetchLog(url)
}
# Note that data.table::fread() relies on R.utils::decompressFile().
This means that logs are intelligently cached; those that have already been downloaded in your current R session will not be downloaded again.
With R 4.0.3, the timeout value for internet connections became more explicit. Here are the relevant details from that releaseâs âNew featuresâ:
The default value for options("timeout") can be set from environment variable
R_DEFAULT_INTERNET_TIMEOUT, still defaulting to 60 (seconds) if that is not set
or invalid.
This change can affect functions that download logs. This is especially
true over slower internet connections or when youâre dealing with large
log files. To fix this, fetchCranLog()
will, if needed, temporarily
set the timeout to 600 seconds.
Footnotes

Specifically, within each 5% interval of rank percentiles (e.g., 0 to 5, 5 to 10, 95 to 100, etc.), a random sample of 5% of packages is selected and tracked. â©