Is Facial Recognition Biased?
By Jayant, Bhavya - EZ Technology
Mankind has witnessed three industrial revolutions, starting with the development of the Steam Engine, followed by electricity and digital computing. We are on the verge of a 4th industrial revolution that will be primarily driven by Artificial Intelligence and Big Data. Artificial Intelligence heavily relies on the data for the development of algorithms that can reason about the decision-making done by the intelligent systems or computers systems only.
Face Recognition: Modern Day Biometric Security Solution
The advent of these advanced technologies has provided us with various techniques for security solutions that will prevent unauthorized access to precious data, providing a sense of security to our clients. However, selecting the appropriate biometric security solution has become a major decision-making process for businesses and enterprises, across wide industries. A new biometric security system that has arrived under the umbrella of Artificial Intelligence is a Face recognition system.
With the ease of implementation and widespread adoption - face recognition is rapidly becoming the go-to choice for the modern implementation of Biometric Solutions. Facial recognition is a modern-day biometric solution developed for the purpose of recognizing a human face without any physical contact required. Facial recognition algorithms are designed to match the facial features of a person to the images or the facial data available in the database saved.
Facial Recognition The Next Big Thing?
Research and studies on Facial Recognition have been conducted for many years now, but there has been an unprecedented growth when we talk about the actual implementation of Facial Recognition. Technology has become so efficient that now we can unlock our phones using facial recognition. Countries have also started using facial recognition for surveillance purposes to track down criminals and use it to prevent crime. Tracking down criminals has become too easy with the help of facial recognition. All we need to do is set up a camera in public spaces and check if any criminal/suspicious person shows up.
Recent Studies Suggest Otherwise
Recent studies and research have suggested that the leading facial recognition software packages are biased. Yes! You read it right. Leading facial recognition packages tend to be more accurate for white, male faces than for people of color or for women.
In a 2019 study, it was found out that many commercial algorithms currently being used for surveillance show a high False Positive Rate for the minority community. There have been some cases around the world where someone innocent got arrested due to false positives shown by these surveillance devices. One such incident happened in January 2020, in Detroit, when police used facial recognition technology on surveillance footage of theft to falsely arrest a Black man.
Let us try and identify what lies at the core of this based nature of face recognition software programs. Facial recognition application is broadly divided into two parts; Verification and Identification.
- Verification confirms that the faceprint matches with the stored faceprint.
- This is usually used at airports and to unlock your smartphone.
- The verification part of Facial recognition is not biased, in fact extremely accurate; here, artificial intelligence is as skillful as the sharpest-eyed humans.
- The real issue is the Identification part of Facial Recognition, which is used for surveillance.
Disparate False Positive Rates
The false Positive Rate of 60 per 10,000 samples for minority groups might not seem much, but when you compare it with the Positive rate of <5 per 10,000 samples for white people, you can clearly see the difference. We need to make sure that the false-positive rate in the identification model should be minimal since this is usually used on crowd surveillance. If you are using facial recognition for crowd surveillance, and you are monitoring around 5000 people in a day, you could easily end up with hundreds of people being falsely accused.1
Finding A Solution
Once the issue was identified, AI researchers started working on finding a solution to the biases available to these facial recognition models. In June 2020, IBM announced it would no longer offer a facial recognition service, while other service providers have acknowledged the issue and started working on finding a solution2. There has also been a public backlash against crowd surveillance.
The reason why there is such a high false-positive rate in facial recognition for a minority group is that the data on which these models were built had an uneven distribution of racial faces.
To avoid such errors, new databases and techniques have been used:
- Techniques of augmentation of feature space of underrepresented classes were to make the dataset more balanced.
- Recently, Generative Adversarial Networks (GAN) were also trained to generate face features to augment classes with fewer samples.
- People have also started shifting to more balanced datasets like Racial Faces in the Wild (RFW) and Balanced Faces In the Wild (BFW) to reduce the bias3.
There has been a great improvement in accuracy for facial recognition in the past few years. Researchers have better models and constructed better datasets to provide highly accurate and low bias models. Big service providers have acknowledged the problem, constantly researching to create accurate surveillance models. The future of facial Recognition seems bright now as the awareness among other service providers and clients has increased about the drawbacks of such technology.
Known Security Issues in Python Dependency Management System and How to Tackle them.
By Jayant, Anjali - EZ Technology
We, at EZ, believe that the purpose of technology is to assist us, and not replace us. Therefore, before becoming dependent on any programming language, we understand its flaws and make conscious efforts to overcome them. As a programming language, Python provides us with innumerable Python libraries and frameworks, a mature and supportive Python Community, versatility, efficiency, reliability, speed, and more. We work with Python so extensively that its security flaws often get ignored. Read the blog below to know about the security loopholes found in the PyPI ecosystem, and how we can overcome them.
What is PIP?
- PIP, or Python Package Installer for Python, is a default python package manager that provides a platform for developers to share and reuse the codes written by third-party developers.
- PIP supports downloading packages from PyPI.org, a repository for the Python programming language. PyPI helps in finding and installing packages or software for python programming languages.
- By design, the PyPI ecosystem allows any arbitrary user to share and reuse python software packages, which along with their dependencies, are downloaded recursively with the help of PIP.
Security risk while the installation of Python Packages
Bagmar et al. had provided a detailed study on the security threats in the python ecosystem, which is largely based on the PyPI repository database.
- Every time, while PIP installs invocation, two python files are executed, namely, setup.py and __init__.py.
- Along with these executions, some arbitrary Python codes, which may contain exploits, also get executed at varying points.
- Exploits come in two modes, which are given below:
- Directly from the source, using editable mode installation, and importing the malicious package.
- Installation using sudo(administrator) privileges.
Factors that help us determine the impact of exploiting python packages
There are four main factors that can help us understand the impact of exploiting Python packages, which are given below:
- Package Reach: It is defined as the number of other packages that explicitly require it transitively or directly. Packages with high package reach are liable to higher attack vectors, making them malicious.
- Maintainer Reach: It is the combined reach of all the Maintainer packages. Influential Maintainers are the potential targets for security attacks.
- Implicitly Trusted Packages: It is the number of distinct nodes traversed while searching for the longest path from a given starting node. An increase in implicitly trusted packages increases security risk attacks.
- Implicitly Trusted Maintainers: This metric gives the vulnerability score based on other package maintainer's accounts.
Most common Python Package Impersonation Attacks
Package impersonation attacks are user-centric attacks, which aim at tricking users to download a malicious package.
There are various ways of fooling the users, and making them download malicious packages, some of which are given below:
- TypoSquating: Intentionally making minor spelling mistakes.
- Altering Word Order: Changing the order in which packages name themselves.
- Python3 vs Python2: Adding number “3” in the package, imitating the original package, with support to python3.
- Removing Hyphenation: Removing hyphen in the genuine packages.
- Built-In Packages: There are multiple instances of packages being uploaded to PyPI.
- Jellyfish Attack: In this attack, a TypoSquat package is imported somewhere.
License Violation in PyPI ecosystem
PyPI does not perform any automated checks for OSS license violations. Any violation can be considered when a package imports another package having a less permissible license.
Suggested Preventive Measures
- There should be strict enforcement and compulsion to specify dependencies in the metadata of uploaded packages.
- A permission model, similar to mobile phones, can be implemented while installing packages.
- Having a trusted or maintainer package badge on a popular package might be helpful.
- Showing statistics while installing packages.
- License fields must not be free text.
Worried about Insider Threats?
By Jayant, Bhavya - EZ Technology
In the new changing dynamics of the world economy, data and information have become priceless possessions. According to one of the articles by The Economist1, the world’s most valuable resource is no longer oil, but data. With data becoming a valuable resource, securing it and ensuring that it is not misused, has become a matter of grave concern. Hence, it is imperative to take a step ahead of our adversaries and look for security problems associated with storing and handling data.
Cyber Security is the convergence of people, processes, and technology, to protect organizations, individuals, or networks, from digital attacks. It is comparatively easier to prevent cyber attacks, like phishing and malware, but stopping an insider attack is an incredibly daunting task. Insider attacks originate within the organization, and the attackers are generally closely associated with the workplace, directly, indirectly, physically, or logically. Interestingly, insider attacks are the most underestimated attacks in cybersecurity, but preventing them is an extremely challenging task. Training a model that can help prevent insider attacks is extremely difficult, due to the imbalanced nature of the dataset. Moreover, insider attacks are rare anomalies, so we do not have enough data that can be used to train a model.
Application of Machine Learning, in cybersecurity and data security, has always been a challenge, and scarcity of available annotated data resources aggravates this challenge further. Moreover, the availability of a balanced dataset makes machine learning all the more difficult. In the past, techniques, such as random oversampling, undersampling, SMOTE, and more, were used to make the dataset balanced. Synthetic data was created to handle skewed data, too. However, none of those techniques were effective.
We, at EZ, work relentlessly to improve and devise new techniques, such that our clients rest assured about the security of the valuable information they entrust us with. Recently, while reading a paper on Cybersecurity and Deep Learning2, we found a new way to detect and prevent insider attacks. The proposed solution is split into three parts, namely, behavior extraction, conditional GAN-based data augmentation, and anomaly detection.
In behavior extraction, feature extraction is done from the dataset. Context-based behavior profiling is used, in which each user is identified as an insider, based on the entire activity log, where all the features contribute to the user behavior. Then, a conditional Generative Adversarial Network (GAN) is used to generate data and reduce the negative effect of skewed data. Gan models consist of two parts, namely, generator and discriminator. In the network, the discriminator (D) tries to distinguish whether the data is from the real distribution, and the generator (G) generates synthetic data and tries to fool the discriminator. The research paper uses a fully connected neural network in the generator and discriminator.
The final part of the proposed solution is to use multiclass classification, instead of binary classification. Anomaly detection based on multiclass classification considers labeled samples of training data as multiple normal and non-malicious classes. The multinomial classifier tries to discriminate the anomalous samples from the rest of the classes, which helps in building a more robust classifier. One additional feature of using multiclass classification is that in case a new insider activity emerges, there would be no need to make any changes to the existing framework. We should use t-distributed Stochastic Neighbor Embedding (t-SNE), a manifold-learning-based visualization method, to perform a qualitative analysis of the generated data. XGBoost, MLP, and 1-d CNN models were used in the research paper, XGBoost performed better for all sorts of datasets.
Intrigued to know more about Cyber Security and the unconventional ways to prevent insider attacks? Read the Reference articles provided below -
- Mayra Macas, & Chunming Wu. (2020). Review: Deep Learning Methods for Cybersecurity and Intrusion Detection Systems.
- Gautam Raj Mode, & Khaza Anuarul Hoque. (2020). Crafting Adversarial Examples for Deep Learning-Based Prognostics (Extended Version).
- Ihai Rosenberg, Asaf Shabtai, Yuval Elovici, & Lior Rokach. (2021). Adversarial Machine Learning Attacks and Defense Methods in the Cyber Security Domain.
- Li, D., & Li, Q. (2020). Adversarial Deep Ensemble: Evasion Attacks and Defenses for Malware Detection. IEEE Transactions on Information Forensics and Security, 15, 3886–3900.
- Simran K, Prathiksha Balakrishna, Vinayakumar Ravi, & Soman KP. (2020). Deep Learning-based Frameworks for Handling Imbalance in DGA, Email, and URL Data Analysis.
- "Regulating the internet giants - The world's most valuable resource ...." 6 May. 2017
- "Multi-class Classification Based Anomaly Detection of Insider Activities." 15 Feb. 2021
- "Deep Learning Methods for Cybersecurity and Intrusion Detection ...." 4 Dec. 2020
- "Crafting Adversarial Examples for Deep Learning-Based ...." 21 Sep. 2020
- "[2007.02407] Adversarial Machine Learning Attacks and Defense ...." 5 Jul. 2020
- "Adversarial Deep Ensemble: Evasion Attacks and Defenses for ...." 30 Jun. 2020
- "Deep Learning-based Frameworks for Handling Imbalance in DGA ...." 31 Mar. 2020