Data Cleaning

Typically with surveys, data will be entered incorrectly by interviewers or issues with recipes are discovered only after interviews had been conducted. Hence it is likely that respondent and consumption data will need correction as a post processing step.

Dita implements an interview-set transformer, that corrects interview data based on uploaded correction files. Those correction files have following structure:

Correction24
Figure 1. Correction24
Correction YAML basic structure
respondents:
- alias: "EB_9999"
  #...
# and so on ...
foodByName:
- name: "Maiswaffeln"
  sid: "at.gd/2.0:food/02423"
# and so on ...
composites:
- coordinates:
    #...
  rename: # optional: new name or empty or omitted
  deletions:
    #...
  additions:
    #...
# and so on ...

Respondent Correction

Allows to correct defect data entered during interview such as the Respondent-Id, Sex or Date of Birth. Also allows to remove interviews from respondents, that later wished to withdraw from the survey.

Respondent correction
Figure 2. Respondent correction

Some examples:

Fixing respondent-id typo
respondents:
- alias: "EB9999"
  newAlias: "EB_9999"
Fixing respondent-data
respondents:
- alias: "EB_9999"
  dateOfBirth: 1999-11-22
  sex: MALE
Respondent withdrawal
respondents:
- alias: "EB_9999"
  withdraw: true

Consumption Correction

Food

Correcting food consumptions, that have no identifier, only a name.

Food by name correction
Figure 3. Food by name correction
Fixes food with missing identifier (having a name but no sid)
foodByName:
- name: "Maiswaffeln"
  sid: "at.gd/2.0:food/02423"

Composite

Correction of composite consumptions supports 3 basic changes:

  • Renaming of the composite consumption entry

  • ADD Ingredient:

    • requires identifier (sid) of food to add

    • requires amountGrams of food to add

    • requires facets of food to add

  • DELETE Ingredient:

    • requires identifier (sid) of food to remove

After those changes are applied, all the ingredient amounts are recalculated such that the composite’s total amount consumed stays the same (as compared to before the correction).

Composite correction
Figure 4. Composite correction
Fixes a composite consumption by deleting and adding specific ingredients
composites:
- coordinates:
    sid: "at.gd/2.0:recp/00514"
    respondentId: "EB_9999"
    interviewOrdinal: 1
    mealHourOfDay: "13:00:00"
    source: "wave1/Interview-12345.xml"
  deletions:
    # DELETE food/02280 Fond, Fleisch {assocRecp=465} 413.56g (82.71%)
  - sid: "at.gd/2.0:food/02280"
  additions:
    # ADD food/01399 Wasser, Leitung 302,72g
  - sid: "at.gd/2.0:food/01399"
    amountGrams:  302.72
    facets: ""
    # ADD food/01581 Streuwürze 6,05g
  - sid: "at.gd/2.0:food/01581"
    amountGrams:  6.05
    facets: ""
Changes a composite consumption name
composites:
- coordinates:
    sid: "at.gd/2.0:recp/00514"
    respondentId: "EB_9999"
    interviewOrdinal: 1
    mealHourOfDay: "13:00:00"
    source: "wave1/Interview-12345.xml"
  rename: "New Name"
  deletions: []
  additions: []

Consumption Identification

Composite coordinates
Figure 5. Composite coordinates

Consumption entries have no identifier per-se, so we use multiple coordinates to narrow down specific entries:

  • sid: SemanticIdentifier of the recipe in question

  • respondentId

  • interviewOrdinal

  • mealHourOfDay

  • source: path of the interview source file in question

Special care needs to be taken when uploading new interview data, as this may render those coordinates invalid. It may also render any of the above corrections invalid!

Working with multiple Correction Files

Multiple correction files can be uploaded each representing a Correction24 data structure. Dita automatically collects these into a single Correction24 object for interview data post processing.

Here are some templates:

only correcting respondents
respondents:
- alias: "EB_9999"
  #...
# and so on ...
foodByName: []
composites: []
only correcting composite consumptions
respondents: []
foodByName: []
composites:
- coordinates:
    #...
  rename: # optional: new name or empty or omitted
  deletions:
    #...
  additions:
    #...
# and so on ...