In curating data we were forced to make certain decisions to create the most accurate estimates possible. This page is a record of our general data construction practices for our data curation project.

General Data Cleaning

We take the cleaned data as given from the World Bank LSMS-ISA team. In the process of working with the data, however, we address any illogical entries that surface (in a way that makes it difficult to merge data files or produced impossible values for the final indicators), though this is rare.


  1. Where both a resident man and woman are listed as married heads of household (i.e., the spouse is categorized as another head), we revised this to categorize the woman as a spouse, not another head.

  2. Where respondents reported the number of animals sold and the value received in a manner that indicates these columns had been accidentally switched (e.g., 150 cows sold for one Ethiopian Birr), these values were switched before estimating livestock income.

Outliers and winsorizing

Extreme outliers can influence the average value of an indicator. Outliers are dealt with in a variety of ways including trimming, dropping, replacing with median or mean, multiple imputation, etc. The choice of method to deal with outliers can make an important difference in the result depending on the distribution of the variable.

We apply the same approach to dealing with potential outlier observations across all indicators. Before computing the summary statistics of final constructed indicators, we identify outliers using the 1st and 99th percentile of the indicator’s distribution. We then winsorize values that were either smaller or larger than these thresholds. For example, we apply the value at the 99th percentile to any observations that are larger than this threshold. Depending on what is logical for a given indicator, we winsorize at both the 1st and 99th percentiles or just the 99th percentile (if there is no illogically small value for the variable). In addition to winsorizing the top 1% of final indicators, we also winsorize the bottom 1% of non-0 values for selected indicators: area harvested, area planted, family labor days, hired labor days, and income components (net crop income, net livestock income, net fishing income, non-ag wage income, ag wage income, net self-employment income, transfers income, and "other" income). For these indicators, we take all observations below the 1st percentile of the distribution and set them to the value at the 1st percentile. When constructing ratios (e.g., yield, labor productivity), we winsorize the numerator and denominator prior to taking the ratio (as in most cases we report estimates for the numerator and denominator variables separately), and also winsorize the top 1% of the constructed ratio. We do not winsorize variables that represent proportions. We do not winsorize by sub-population, but exclude missing observations from the distribution during the winsorizing process.

Specific winsorizing decisions are documented in the "Winsorizing Decision" column of the "Summ. of Indicator Construction" tab. After winsorizing, we generate new winsorized versions of the variables with the prefix "w_" to distinguish them from the un-winsorized variables. Both winsorized and un-winsorized versions of the variables are included in our final .dta files.

Particularly for estimating the average value of a variable, the method chosen for dealing with outliers can make a considerable difference. For this reason, attention should also be given to the median value and the overall distribution of the variable.

Also, as a result of following the same cleaning and winsorizing protocols across indicators and instruments, some estimates may be affected by large outlier observations that remain after winsorizing. The means for certain estimates may therefore appear abnormally high. We encourage users to consider the full set of summary statistics and not just the means when considering particular estimates.


We apply provided survey weights for the three LSMS-ISA surveys (ESS, GHS, TZNPS) and the India RMS AgDev Baseline to generate nationally-representative estimates. Survey weights were not available for the Ethiopia ACC and Tanzania TBS AgDev Baselines. Weights are used to ensure that statistics estimated with the sample are unbiased estimates of the population parameters. Thus, weighted statistics can only be reported at geographical levels where the sample is representative of the population. In the India baseline, the sample is representative at the state level and statistics are reported by state. In the LSMS-ISA, the sample is nationally representative but not always regionally representative; thus, we only report the statistics at the national level. For the two instruments where no survey weights are available, we implicitly use a weight of “1” for each observation.

In addition to household weights, we also apply area weights, animal weights, or individual weights to selected indicators to report representative estimates for a given hectare, animal, or individual, respectively. These weights are constructed by multiplying the household weight by the given object, e.g., hectares of area or number of animals. The weights for each estimate are noted in the "Weight" column of the "Summ. of Indicator Construction" tab.


All estimates are reported for the sub-population of rural household only. Estimates for urban households or the full populations can be produced from our final .dta files; we do not remove urban households from the dataset, though the AgDev baseline instruments only surveyed rural households. It is essential to ensure that subpopulations are correctly dealt with and particularly that the syntax of the code is written in a way that includes all observations in the calculation of the standard errors.

For many indicators, we further restrict the sub-population for our estimates to a subset of households engaged in a given activity. For example, we only report crop yields among households engaged in producing the particular crop. For some indicators, we report estimates for both the full sub-populations of rural households and for the subset of rural household engaged in a particular activity. Specific sub-population decisions are documented in the "Sub-Population for Estimate" column of the "Summ. of Indicator Construction" tab.

For all indicators, when the sample size of the relevant a sub-population is less than 30, we urge caution in interpreting estimates, as it is likely that the statistical power will be too small to obtain valid estimates.


In constructing the yield for both area planted and area harvested we rescale the areas down to fit onto the area of the plot, taking into account the area for pure stand crops and intercropped crops differently. In general, we take the area planted or harvested of purestand crops at the farmer reported value, and then we rescale all intercropped crops to fit on the remainder of the plot. That is, for intercropped crops we adjust the value of area planted or harvested down in proportion to the area reported for each crop so that the sum of the area of all crops on a plot do not exceed the area of the plot. We only rescale down - we do not rescale crop areas up if the sum of the area planted or harvested is less than the plot area.

In general, there are three different times that we rescale these areas:

  • Capping: When a GPS measurement of the plot area is available, we cap any individual crop area planted or harvested on that plot at the plot area.

    • We cap the area planted at the plot size for Nigeria LSMS and the area harvested at the plot size for Nigeria LSMS and Tanzania LSMS

    • The area planted for Tanzania LSMS and Ethiopia LSMS, and the area harvested for Ethiopia LSMS are reported as a percentage of the plot size (not to exceed 100%) so no capping is necessary

    • GPS measurement of the plots is not available for the AgDev instruments, so no capping is possible for area planted or area harvested

  • Area planted rescaling:

    • We rescale area planted for Tanzania LSMS, Nigeria LSMS, Ethiopia LSMS, Tanzania AgDev, and Ethiopia AgDev in the manner described above

    • Plot size is not available for Nigeria AgDev, so rescaling is not possible

    • Intercropping information is not available for India AgDev, so we rescale all crops to fit on the plot regardless of purestand or intercropping status

  • Area harvested rescaling:

    • We rescale area harvested for Tanzania LSMS, Nigeria LSMS, Tanzania AgDev, and Ethiopia AgDev in the manner described above

    • For Ethiopia LSMS, the area harvested is reported as a percentage of the area planted. Using the rescaled area planted to construct area harvested, rescaling is not neccessary

    • Area harvested is not reported in Nigeria AgDev

    • Intercropping is not reported in India AgDev, so area harvested is rescaled in the same manner as area planted for this instrument

Gender of the head of household

We disaggregate the majority of our estimates by the gender of the head of household. The head of household is defined differently depending on the instrument, but is reported by the survey respondent and is often regarded by household members the main decision-maker in the household. The primary survey respondent is often the head of household.

  • The TZA NPS defines the head of household as "the member of the given household who holds the role of decision maker in that household; other residents normally recognize this individual as their head. In most cases the household head should take part in the economy, control, and the welfare of the household in general."

  • The ETH ESS defines the head of household as "the person commonly regarded by the household members as their head. The head would usually be the main income earner and decision-maker for the household, but you should accept the decision of the household members as to who is their head."

  • The NGA GHS defines the head of households as "a person defined as such for the purpose of the survey, irrespective of reason (the oldest by age, decision maker in the household, a person who earns the most income, based on tradition, etc.)."

  • The ETH ACC defines the head of household as "the member of the household who makes most of the economic decisions. You may accept the judgment of the respondent regarding who is the head."

  • We do not have information on how the head of household is defined for the IND RMS, TZA TBS nor NGA NIBAS.

Gender of plot manager

We also disaggregate certain estimates by the gender of the plot manager. In most instruments, respondents are asked to identify the household members who manage or make decisions about each cultivated plot, and may list 1 or more household members. Based on these responses, we label plots as being female-only, male-only, or mixed gender based on the genders of the households members listed as managers. The questions about plot management are different across instruments.

  • ETH ESS: Who makes primary decisions concerning the plot (1, then up to 2 additional).

  • NGA GHS: Who in the HH manages this plot (up to 2) and other HH members who are decision-makers on this plot (up to 4)

  • TZA NPS: Who decided what to plant on this plot in LRS (up to 3)

  • ETH ACC: Which family member had main responsibility for farming this plot (up to 2)

  • TZA TBS: Who is the primary decision-maker for cultivation on this plot? (1)

  • IND RMS: no question on plot managers

  • NGA NIBAS: Who in the household manages this plot? (1)

The same question on plot managers is used for invidual-level indicators disaggregated by gender of the individual plot manager.

Crop disaggregation

We report most crop production-related estimates disaggregated by crop. We focus on 12 specific focus crops: maize, rice, wheat, sorghum, millet, cowpea, groundnut, beans, yam, sweet potato, cassava, and banana. We also include any crops not included in this list of 12 crops that are in the list of top 10 crops by area planted for each instrument. For LSMS instruments with multiple waves, we aggregate, weight, and average the top crop area across waves. For the Tanzania AgDev, Nigeria AgDev, and India RMS where we report estiamtes by state/zone, we calculate the top 10 crops by area planted by state/zone. Not all crops are included in each instrument. Data on production from other crops is included along with these focus crops in estimates for "All crops."

Top 10 crops by area planted:

  • ETH ESS: teff, barley, coffee, sesame, horsebean, nueg
  • NGA GHS: cocoa, soy bean
  • TZA NPS: cotton, sunflower, pigeon pea
  • ETH ACC: white teff, sesame, barley, mixed teff, coffee
  • TZA TBS: coffee, mango, mung bean, avocado, pigeon pea, potato, sunflower, cashew
  • IND RMS: jute
  • NGA NIBAS: sesame, squash, rubber, soy bean, potato

Currency conversion

All estimates with units in local currency are converted to 2016 PPP $, for consistency across instruments and years. We selected 2016 for the common currency year as a plurality of instruments (5 or 13) were conducted in 2016. We first convert local currency values in the survey year to 2016 values by adjusting for inflation. We use World Bank data on the CPI (, and calculate an inflation rate following the formula (2016 CPI - Survey Year CPI)/(Survey Year CPI). For surveys conducted over 2 years (as in the LSMS-ISA), we take the CPI from the later survey year. We multiply the local currency value in the survey year by (1 + rate of inflation) to obtain a value in 2016 local currency units. We then convert 2016 local currency values to 2016 PPP $ by dividing by the 2016 private consumption PPP conversion factor or the GDP PPP conversion factor. The PPP conversion factors are provided by the World Bank ( and are extrapolated from the 2011 International Comparison Program (ICP) benchmark. The World Bank CPI and PPP data are occasionally revised. The conversions in our analysis are based on data accessed on 9 February 2018.


Generally, when we value something for which a price was not observed, we infer prices from unit values by dividing value of sales by quantity sold. If no prices are observed and we cannot infer them from sales, we use the median per-unit value at the smallest (most local) geographic unit for which we have at least 10 observations of market prices (or, sometimes, respondent-estimated values). When the country-level median is used (indicating no lower geographic units with at least 10 observations), we set no minimum number of observations. This follows the World Bank protocol for imputing prices. The imputed price is specific to a given item-unit combination (for example, a kg of sorghum, a chicken, a basket of fish). These imputed prices are relevant for estimating the value of crop production and sales, the value of livestock (and livestock products) production and sales, the value of fishing production and sales, and costs of crop, livestock, and fishing production. We are not able to impute values for item-unit combinations with no observed price or sales or in the dataset.

Plot sizes

Across all surveys, plot sizes are converted to hectares. In surveys where plots were never measured by GPS, we rely entirely on the respondents’ estimates. In surveys where plots were sometimes measured by GPS, we use the GPS-measured values where available and the respondent estimates for unmeasured plots. In the Tanzania NPS, where all plot areas are reported in acres, we refer to the farmer estimates where measures are missing. In the Ethiopia ESS, where the units for plot areas are quite diverse, we use conversion factors provided with the data set when available and estimate the size of units missing from the conversion file by referring to the local median per-unit measured area to estimate the area of plots that were not themselves measured. In the Nigeria GHS, where the units for plot areas are not as diverse, we use the respondent’s estimates when the unit is acres or hectares, and apply a conversion factor provided with the data set when the unit is “heaps” or “ridges”.

An alternative approach to measuring plot area for instruments where a sufficient number of GPS-measured and respondent-estimate pairs of area values are available would be to apply multiple imputation techniques.