Laying foundations for effective machine learning in law enforcement. Majura – a labelling schema for child exploitation materials

Janis Dalins, Yuriy Tyshetskiy, Campbell Wilson, Mark J. Carman, Douglas Boudry

Research output: Contribution to journalArticleResearchpeer-review

Abstract

The health impacts of repeated exposure to distressing concepts such as child exploitation materials (CEM, aka ‘child pornography’) have become a major concern to law enforcement agencies and associated entities. Existing methods for ‘flagging’ materials largely rely upon prior knowledge, whilst predictive methods are unreliable, particularly when compared with equivalent tools used for detecting ‘lawful’ pornography. In this paper we detail the design and implementation of a deep-learning based CEM classifier, leveraging existing pornography detection methods to overcome infrastructure and corpora limitations in this field. Specifically, we further existing research through direct access to numerous contemporary, real-world, annotated cases taken from Australian Federal Police holdings, demonstrating the dangers of overfitting due to the influence of individual users’ proclivities. We quantify the performance of skin tone analysis in CEM cases, showing it to be of limited use. We assess the performance of our classifier and show it to be sufficient for use in forensic triage and ‘early warning’ of CEM, but of limited efficacy for categorising against existing scales for measuring child abuse severity. We identify limitations currently faced by researchers and practitioners in this field, whose restricted access to training material is exacerbated by inconsistent and unsuitable annotation schemas. Whilst adequate for their intended use, we show existing schemas to be unsuitable for training machine learning (ML) models, and introduce a new, flexible, objective, and tested annotation schema specifically designed for cross-jurisdictional collaborative use. This work, combined with a world-first ‘illicit data airlock’ project currently under construction, has the potential to bring a ‘ground truth’ dataset and processing facilities to researchers worldwide without compromising quality, safety, ethics and legality.

Original languageEnglish
Pages (from-to)40-54
Number of pages15
JournalDigital Investigation
Volume26
DOIs
Publication statusPublished - Sep 2018

Keywords

  • Annotation schema
  • Child exploitation
  • Digital forensics
  • Forensic triage
  • Neural networks

Cite this

@article{5e2fa9454ad6449681d331dd66eda82d,
title = "Laying foundations for effective machine learning in law enforcement. Majura – a labelling schema for child exploitation materials",
abstract = "The health impacts of repeated exposure to distressing concepts such as child exploitation materials (CEM, aka ‘child pornography’) have become a major concern to law enforcement agencies and associated entities. Existing methods for ‘flagging’ materials largely rely upon prior knowledge, whilst predictive methods are unreliable, particularly when compared with equivalent tools used for detecting ‘lawful’ pornography. In this paper we detail the design and implementation of a deep-learning based CEM classifier, leveraging existing pornography detection methods to overcome infrastructure and corpora limitations in this field. Specifically, we further existing research through direct access to numerous contemporary, real-world, annotated cases taken from Australian Federal Police holdings, demonstrating the dangers of overfitting due to the influence of individual users’ proclivities. We quantify the performance of skin tone analysis in CEM cases, showing it to be of limited use. We assess the performance of our classifier and show it to be sufficient for use in forensic triage and ‘early warning’ of CEM, but of limited efficacy for categorising against existing scales for measuring child abuse severity. We identify limitations currently faced by researchers and practitioners in this field, whose restricted access to training material is exacerbated by inconsistent and unsuitable annotation schemas. Whilst adequate for their intended use, we show existing schemas to be unsuitable for training machine learning (ML) models, and introduce a new, flexible, objective, and tested annotation schema specifically designed for cross-jurisdictional collaborative use. This work, combined with a world-first ‘illicit data airlock’ project currently under construction, has the potential to bring a ‘ground truth’ dataset and processing facilities to researchers worldwide without compromising quality, safety, ethics and legality.",
keywords = "Annotation schema, Child exploitation, Digital forensics, Forensic triage, Neural networks",
author = "Janis Dalins and Yuriy Tyshetskiy and Campbell Wilson and Carman, {Mark J.} and Douglas Boudry",
year = "2018",
month = "9",
doi = "10.1016/j.diin.2018.05.004",
language = "English",
volume = "26",
pages = "40--54",
journal = "Digital Investigation",
issn = "1742-2876",
publisher = "Elsevier",

}

Laying foundations for effective machine learning in law enforcement. Majura – a labelling schema for child exploitation materials. / Dalins, Janis; Tyshetskiy, Yuriy; Wilson, Campbell; Carman, Mark J.; Boudry, Douglas.

In: Digital Investigation, Vol. 26, 09.2018, p. 40-54.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Laying foundations for effective machine learning in law enforcement. Majura – a labelling schema for child exploitation materials

AU - Dalins, Janis

AU - Tyshetskiy, Yuriy

AU - Wilson, Campbell

AU - Carman, Mark J.

AU - Boudry, Douglas

PY - 2018/9

Y1 - 2018/9

N2 - The health impacts of repeated exposure to distressing concepts such as child exploitation materials (CEM, aka ‘child pornography’) have become a major concern to law enforcement agencies and associated entities. Existing methods for ‘flagging’ materials largely rely upon prior knowledge, whilst predictive methods are unreliable, particularly when compared with equivalent tools used for detecting ‘lawful’ pornography. In this paper we detail the design and implementation of a deep-learning based CEM classifier, leveraging existing pornography detection methods to overcome infrastructure and corpora limitations in this field. Specifically, we further existing research through direct access to numerous contemporary, real-world, annotated cases taken from Australian Federal Police holdings, demonstrating the dangers of overfitting due to the influence of individual users’ proclivities. We quantify the performance of skin tone analysis in CEM cases, showing it to be of limited use. We assess the performance of our classifier and show it to be sufficient for use in forensic triage and ‘early warning’ of CEM, but of limited efficacy for categorising against existing scales for measuring child abuse severity. We identify limitations currently faced by researchers and practitioners in this field, whose restricted access to training material is exacerbated by inconsistent and unsuitable annotation schemas. Whilst adequate for their intended use, we show existing schemas to be unsuitable for training machine learning (ML) models, and introduce a new, flexible, objective, and tested annotation schema specifically designed for cross-jurisdictional collaborative use. This work, combined with a world-first ‘illicit data airlock’ project currently under construction, has the potential to bring a ‘ground truth’ dataset and processing facilities to researchers worldwide without compromising quality, safety, ethics and legality.

AB - The health impacts of repeated exposure to distressing concepts such as child exploitation materials (CEM, aka ‘child pornography’) have become a major concern to law enforcement agencies and associated entities. Existing methods for ‘flagging’ materials largely rely upon prior knowledge, whilst predictive methods are unreliable, particularly when compared with equivalent tools used for detecting ‘lawful’ pornography. In this paper we detail the design and implementation of a deep-learning based CEM classifier, leveraging existing pornography detection methods to overcome infrastructure and corpora limitations in this field. Specifically, we further existing research through direct access to numerous contemporary, real-world, annotated cases taken from Australian Federal Police holdings, demonstrating the dangers of overfitting due to the influence of individual users’ proclivities. We quantify the performance of skin tone analysis in CEM cases, showing it to be of limited use. We assess the performance of our classifier and show it to be sufficient for use in forensic triage and ‘early warning’ of CEM, but of limited efficacy for categorising against existing scales for measuring child abuse severity. We identify limitations currently faced by researchers and practitioners in this field, whose restricted access to training material is exacerbated by inconsistent and unsuitable annotation schemas. Whilst adequate for their intended use, we show existing schemas to be unsuitable for training machine learning (ML) models, and introduce a new, flexible, objective, and tested annotation schema specifically designed for cross-jurisdictional collaborative use. This work, combined with a world-first ‘illicit data airlock’ project currently under construction, has the potential to bring a ‘ground truth’ dataset and processing facilities to researchers worldwide without compromising quality, safety, ethics and legality.

KW - Annotation schema

KW - Child exploitation

KW - Digital forensics

KW - Forensic triage

KW - Neural networks

UR - http://www.scopus.com/inward/record.url?scp=85047981760&partnerID=8YFLogxK

U2 - 10.1016/j.diin.2018.05.004

DO - 10.1016/j.diin.2018.05.004

M3 - Article

VL - 26

SP - 40

EP - 54

JO - Digital Investigation

JF - Digital Investigation

SN - 1742-2876

ER -