Agricultural Development Data Curation

Welcome to EPAR’s data access page. Here you will find links to agricultural survey data, more than 150 indicators constructed from these data disaggregated by gender, crop, country, farm type, etc.; the cleaning and construction decisions; code that will allow you to replicate, alter or tailor the indicators with different cleaning and construction decisions; and visualizations of the indicators.

EPAR has developed STATA do-files for the construction of a set of agricultural development indicators using data from the Living Standards Measurement Study – Integrated Surveys on Agriculture (LSMS-ISA) as well as multiple other surveys. These LSMS-ISA data and additional information on each survey can be downloaded from the World Bank. Our analysis focuses on the following:

To supplement the data from the LSMS-ISA, which is weighted to be nationally representative, we also use data from four additional Agricultural Development Baseline (AgDev) surveys. By using these surveys we are able to create estimates which are representative at the regional, or state-level rather than the national level. The four surveys are:

Ethiopia Agricultural Commercialization Clusters Survey (ACC)
India Rice Monitoring Survey – 2016 (RMS)
Nigeria Baseline and Varietal Monitoring Survey (NIBAS)
Tanzania Baseline and Varietal Monitoring Survey (TBS)

Currently, these data are only available as estimates in the Indicator Estimate spreadsheet.

We are sharing our code and documenting our construction decisions both to facilitate analyses of these rich datasets and to make estimates of relevant indicators available to a broader audience of potential users. Our final estimates are available using AgQuery, our online data tool. These estimates are for rural households only, but the filter can be removed or changed in the .do files: links for the code are available under Code Distribution. Estimates for Nigerian yields are being revised to account for potential errors in conversion factors. Otherwise, our general construction decisions are available below. The data we generate are available in the Data Dissemination section. To find out more about this project please visit EPAR’s Technical Report #335.

Published Research

In addition to creating the data, these data and the process of creating them have led to published papers written by EPAR staff.

Wineman, A., Njagi, T., Anderson, C. L., Reynolds, T., Wainaina, P., Njue, E., Biscaye, P., Ayieko, M. W (2020). “A case of mistaken identity? Measuring rates of improved seed adoption in Tanzania using DNA fingerprinting”. Journal of Agricultural Economics. http://doi: 10.1111/1477-9552.12368
Wineman, A., Anderson, C. L., Reynolds, T., Biscaye, P. (2019) “Methods of crop yield measurement on multi-cropped plots: Examples from Tanzania,” Food Security. https://doi.org/10.1007/s12571-019-00980-5
Anderson, L., Reynolds, T. (2019). Measurement choices with consequences: How we measure yield, crop diversity and smallholders can mischaracterize contributions of agrobiodiversity to smallholder livelihoods. Invited contribution to Mainstreaming Agrobiodiversity in Sustainable Food Systems, Bioversity International.

Visualizations

We have several interactive visualizations which utilize these data to investigate questions about smallholder farmers, household categorizations, and the stability of indicators over time. Examples of visualizations created using these data include:

Sixteen Categories of African Agricultural Households: This divides households into categories using four binary measures: poverty, asset base, market orientation, and diversification, and demonstrates how they change over time
Year Over Year Threshold Variability: This examines the movement of households across different thresholds (e.g. 2 or 4 hectares) using several common indicators (e.g. land) used in definitions of smallholder households.
Detailed Household Comparisons: These allow users to input their own smallholder definitions and compare them for particular countries. Currently we have:

For more visualizations, including some using non-LSMS-ISA data, please check our Visualizations page.

Data Dissemination

Part of EPAR’s mission includes a “commitment to open access tools and broadly accessible dissemination.” As part of this mission, we release all of the data we create for further research, including those based on the series of LSMS-ISA data released by the World Bank. To further aid in this goal we have created AgQuery, our online data tool that can be used to view our set of indicators online or download subsets of the data as a CSV file.

The STATA .dta files generated by our data curation project can be downloaded below:

Download the data from all LSMS-ISA countries
Download the Indicator Estimates Spreadsheet (excel)

Code Distribution

We created these estimates using the STATA software package. Our code is available below, either as a direct download or from our GitHub page.

AgQuery was created using Python and can easily by modified for your own data projects. The code is available below through direct download or our GitHub repository:

General Construction Decisions

General Data Cleaning

We take the cleaned data as given from the World Bank LSMS-ISA team. In the process of working with the data, however, we address any illogical entries that surface (in a way that makes it difficult to merge data files or produced impossible values for the final indicators), though this is rare.

Examples:

Where both a resident man and woman are listed as married heads of household (i.e., the spouse is categorized as another head), we revised this to categorize the woman as a spouse, not another head.
Where respondents reported the number of animals sold and the value received in a manner that indicates these columns had been accidentally switched (e.g., 150 cows sold for one Ethiopian Birr), these values were switched before estimating livestock income.

Outliers and Winsorizing

Extreme outliers can influence the average value of an indicator. Outliers are dealt with in a variety of ways including trimming, winsorizing, replacing with the median or mean, multiple imputation, etc. The choice of method to deal with outliers can make an important difference in the result, depending on the variable’s distribution.

We apply the same approach to dealing with potential outlier observations across all indicators. Before computing the summary statistics of final constructed indicators, we identify outliers using the 1st and 99th percentile of the indicator’s distribution. We then winsorize values that were either smaller or larger than these thresholds. Thus, we apply the value at the 99th percentile to any observations that are larger than this threshold. Depending on what is logical for a given indicator, we winsorize at both the 1st and 99th percentiles or just the 99th percentile (if there is no illogically small value for the variable). In addition to winsorizing the top 1% of final indicators, we also winsorize the bottom 1% of non-0 values for selected indicators: area harvested, area planted, family labor days, hired labor days, and income components (net crop income, net livestock income, net fishing income, non-agricultural wage income, agricultural wage income, net self-employment income, transfers income, and “other” income). For these indicators, we take all observations below the 1st percentile of the distribution and set them to the value at the 1st percentile. When constructing ratios (e.g., yield, labor productivity), we winsorize the numerator and denominator prior to taking the ratio (as, in most cases, we report estimates for the numerator and denominator variables separately), and also winsorize the top 1% of the constructed ratio. We do not winsorize variables that represent proportions and we do not winsorize by sub-population.

Specific winsorizing decisions are documented in the “Winsorizing Decision” column of the “Summ. of Indicator Construction” tab in the Indicator Estimates Spreadsheet linked above (filename EPAR_UW_335_AgDev_Indicator_Estimates.xlsx). After winsorizing, we generate new winsorized versions of the variables with the prefix “w_” to distinguish them from the un-winsorized variables. Both winsorized and un-winsorized versions of the variables are included in our final .dta files.

Particularly for estimating the average value of a variable, the method chosen for dealing with outliers can make a considerable difference.

Also, as a result of following the same cleaning and winsorizing protocols across indicators and instruments, some estimates may be affected by large outlier observations that remain after winsorizing. The means for certain estimates may therefore appear abnormally high. For this reason, we encourage users to consider the median value and the overall distribution of the variable.

Weighting

We apply survey weights provided for the three LSMS-ISA surveys (Ethiopia ESS, Nigeria GHS, and Tanzania NPS) and three of the four AgDev surveys (India RMS, Tanzania TBS, Nigeria NIBAS) to generate representative estimates. Survey weights were not available for the Ethiopia ACC AgDev Baseline survey. Weights are used to ensure that statistics estimated with the sample are unbiased estimates of the population parameters. Thus, weighted statistics can only be reported at geographical levels where the sample is representative of the population. In the AgDev instruments the sample is representative at the state or zone level and statistics are reported by state or zone. In the LSMS-ISA, the sample is nationally representative but not always regionally representative; thus, we only report the statistics at the national level. For the Ethiopia ACC, where no survey weights are available, we implicitly use a weight of “1” for each observation.

In addition to household weights, we also apply area weights, animal weights, or individual weights to selected indicators to report representative estimates for a hectare, animal, or individual, respectively. These weights are constructed by multiplying the household weight by the given object, e.g., hectares of area or number of animals. The weights for each estimate are noted in the “Weight” column of the “Summ. of Indicator Construction” tab of the Indicator Estimates spreadsheet linked above.

Sub-populations

All estimates are reported for the sub-population of rural household only. Estimates for urban households or the full populations can be produced from our final .dta files; we do not remove urban households from the dataset, though the AgDev baseline instruments only surveyed rural households.

For many indicators, we further restrict the sub-population for our estimates to a subset of households engaged in a given activity. For example, we only report crop yields among households engaged in producing the particular crop. For some indicators, we report estimates for both the full sub-populations of rural households and for the subset of rural household engaged in a particular activity. Specific sub-population decisions are documented in the “Sub-Population for Estimate” column of the “Summ. of Indicator Construction” tab.

For all indicators, when the sample size of the relevant a sub-population is less than 30, we urge caution in interpreting estimates, as it is likely that the sample will be too small to obtain valid estimates.

Yield

In constructing the yield for both area planted and area harvested we rescale the areas down to fit onto the area of the plot, taking into account the area for pure stand crops and intercropped crops differently. In general, we take the area planted or harvested of pure stand crops at the farmer reported value, and then we rescale all intercropped crops to fit on the remainder of the plot. That is, for intercropped crops we adjust the value of area planted or harvested down in proportion to the area reported for each crop so that the sum of the area of all crops on a plot do not exceed the area of the plot. We only rescale down – we do not rescale crop areas up if the sum of the area planted or harvested is less than the plot area.

In general, there are three different times that we rescale these areas:

Capping: When a GPS measurement of the plot area is available, we cap any individual crop area planted or harvested on that plot at the plot area.
- We cap the area planted at the plot size for Nigeria GHS and the area harvested at the plot size for Nigeria GHS and Tanzania NPS
- The area planted for Tanzania NPS and Ethiopia ESS, and the area harvested for Ethiopia LSMS are reported as a percentage of the plot size (not to exceed 100%) so no capping is necessary
- GPS measurement of the plots is not available for the AgDev instruments, so no capping is possible for area planted or area harvested
Area planted rescaling:
- We rescale area planted for Tanzania NPS, Nigeria GHS, Ethiopia ESS, Tanzania TBS, and Ethiopia ACC in the manner described above
- Plot size is not available for Nigeria NIBAS, so rescaling is not possible
- Intercropping information is not available for India RMS, so we rescale all crops to fit on the plot regardless of purestand or intercropping status
Area harvested rescaling:
- We rescale area harvested for Tanzania NPS, Nigeria GHS, Tanzania TBS, and Ethiopia ACC in the manner described above
- For Ethiopia ESS, the area harvested is reported as a percentage of the area planted. Using the rescaled area planted to construct area harvested, rescaling is not necessary
- Area harvested is not reported in Nigeria NIBAS<
- Intercropping is not reported in India RMS, so area harvested is rescaled in the same manner as area planted for this instrument

Gender of the Head of Household

We disaggregate the majority of our estimates by the gender of the head of household. The head of household is defined differently depending on the instrument, but is reported by the survey respondent and is often regarded by household members as the main decision-maker in the household. The primary survey respondent is often the head of household.

The Tanzania NPS defines the head of household as “the member of the given household who holds the role of decision maker in that household; other residents normally recognize this individual as their head. In most cases the household head should take part in the economy, control, and the welfare of the household in general.”
The Ethiopia ESS defines the head of household as “the person commonly regarded by the household members as their head. The head would usually be the main income earner and decision-maker for the household, but you should accept the decision of the household members as to who is their head.”
The Nigeria GHS defines the head of households as “a person defined as such for the purpose of the survey, irrespective of reason (the oldest by age, decision maker in the household, a person who earns the most income, based on tradition, etc.).”
The Ethiopia ACC defines the head of household as “the member of the household who makes most of the economic decisions. You may accept the judgment of the respondent regarding who is the head.”
We do not have information on how the head of household is defined for the AgDev instruments: India RMS, Tanzania TBS nor Nigeria NIBAS.

Gender of the Plot Manager

We also disaggregate certain estimates by the gender of the plot manager. In most instruments, respondents are asked to identify the household members who manage or make decisions about each cultivated plot, and may list one or more household members. Based on these responses, we label plots as being female-only, male-only, or mixed gender based on the genders of the households members listed as managers. The questions about plot management are different across instruments.

Ethiopia ESS: Who makes primary decisions concerning the plot (1, then up to 2 additional).?
Nigeria GHS: Who in the household manages this plot (up to 2) and other household members who are decision-makers on this plot (up to 4)?
Tanzania NPS: Who decided what to plant on this plot in the long rainy season (LRS) (up to 3)?
Ethiopia ACC: Which family member had main responsibility for farming this plot (up to 2)?
Tanzania TBS: Who is the primary decision-maker for cultivation on this plot? (1)?
India RMS: no question on plot managers
Nigeria NIBAS: Who in the household manages this plot? (1)?

The same question on plot managers is used for individual-level indicators disaggregated by gender of the plot manager.

Crop Disaggregation

We report most crop production-related estimates disaggregated by crop. We focus on 12 specific focus crops: maize, rice, wheat, sorghum, millet, cowpea, groundnut, beans, yam, sweet potato, cassava, and banana. We also include any crops not included in this list of 12 crops that are in the list of top 10 crops by area planted for each instrument. For LSMS-ISA instruments with multiple waves, we aggregate, weight, and average the top crop area across survey waves. For the Tanzania TBS, Nigeria NIBAS, and India RMS where we report estiamtes by state/zone, we calculate the top 10 crops by area planted by state/zone. Not all crops are included in each instrument. Data on production from other crops is included along with these focus crops in estimates for “All crops.”

Other crops in the top 10 by area planted and not included in the focus crops:

Ethiopia ESS: teff, barley, coffee, sesame, horsebean, nueg
Nigeria GHS: cocoa, soy bean
Tanzania NPS: cotton, sunflower, pigeon pea
Ethiopia ACC: white teff, sesame, barley, mixed teff, coffee
Tanzania TBS: coffee, mango, mung bean, avocado, pigeon pea, potato, sunflower, cashew
India RMS: jute
Nigeria NIBAS: sesame, squash, rubber, soy bean, potato

Currency Conversion

All estimates with units in local currency are converted to 2016 Purchasing Power Parity dollars (PPP $), for consistency across instruments and years. We selected 2016 for the common currency year as a plurality of instruments (5 of 13) were conducted in 2016. We first convert local currency values in the survey year to 2016 values by adjusting for inflation. We use World Bank data on the Consumer Price Index (CPI) (https://data.worldbank.org/indicator/FP.CPI.TOTL), and calculate an inflation rate following the formula (2016 CPI – survey year CPI)/(survey year CPI). For surveys conducted over 2 years (as in the LSMS-ISA), we take the CPI from the later survey year. We multiply the local currency value in the survey year by (1 + rate of inflation) to obtain a value in 2016 local currency units. We then convert 2016 local currency values to 2016 PPP $ by dividing by the 2016 private consumption PPP conversion factor or the GDP PPP conversion factor. The PPP conversion factors are provided by the World Bank and are extrapolated from the 2011 International Comparison Program (ICP) benchmark. The World Bank CPI and PPP data are occasionally revised. The conversions in our analysis are based on data accessed on 9 February 2018.

Prices

Generally, when we value something for which a price was not observed, we infer prices from unit values by dividing value of sales by quantity sold. If no prices are observed and we cannot infer them from sales, we use the median per-unit value at the smallest (most local) geographic unit for which we have at least 10 observations of market prices (or, sometimes, respondent-estimated values). When the country-level median is used (indicating no lower geographic units with at least 10 observations), we set no minimum number of observations. This follows the World Bank protocol for imputing prices. The imputed price is specific to a given item-unit combination (for example, a kg of sorghum, a chicken, a basket of fish). These imputed prices are relevant for estimating the value of crop production and sales, the value of livestock (and livestock products) production and sales, the value of fishing production and sales, and costs of crop, livestock, and fishing production. We are not able to impute values for item-unit combinations with no observed price or sales or in the dataset.

Plot Sizes

Across all surveys, plot sizes are converted to hectares. In surveys where plots were never measured by GPS, we rely entirely on the respondents’ estimates. In surveys where plots were sometimes measured by GPS, we use the GPS-measured values where available and the respondent estimates for unmeasured plots. In the Tanzania NPS, where all plot areas are reported in acres, we refer to the farmer estimates where measures are missing. In the Ethiopia ESS, where the units for plot areas are quite diverse, we use conversion factors provided with the data set when available and estimate the size of units missing from the conversion file by referring to the local median per-unit measured area to estimate the area of plots that were not themselves measured. In the Nigeria GHS, where the units for plot areas are not as diverse, we use the respondent’s estimates when the unit is acres or hectares, and apply a conversion factor provided with the data set when the unit is “heaps” or “ridges”.