Mini-Project 4

Due: Thu Dec 01

Points: 100

The goal of this project is to gain a deeper understanding of mobile application privacy and to develop skills for empirical investigation. Specifically, students will use the Amandroid (also called Argus-SAF) static program analysis tool to identify privacy leaks in Android applications. For example, it can identify if privacy-sensitive information from the deviceID() source API flows to a network send() sink API. Amandroid will be applied to a small corpus of applications, and a report will classify and describe the findings.

What to submit: This assignment should be submitted to GradeScope. You should include a PDF document that provides your written answers to the questions for each part.

Collaboration: Each student must submit their own solution. That said, students are encouraged to discuss the project with each other. Program analysis tools can sometimes be tricky to get working, and class discussion will facilitate understanding. The only rule is that the “findings” should be unique. One way to ensure uniqueness is for collaborating students to work on different sets of applications. Finally, if you discuss the project or any insights with another student, you must list their name on your solution PDF in a clearly denoted “Acknowledgements” section.

Dataset: From a May 2019 snapshot of the top 500 most popular free apps in the Google Play Store, I’ve taken an approximation of the apps used for the PoliCheck study. These applications were detected to have privacy leaks during dynamic analysis. I then took the smallest 400 apps (because Amandroid is quite slow) and distributed them in to 16 buckets using a round-robin sorting algorithm (each bucket has about the same distribution of file sizes). I then created .zip files for each bucket: mp4-appset-[0-F].zip. There are 25 apps in each dataset. You should choose the app dataset that corresponds to the first hexadecimal digit of the SHA256 hash of your last name in lowercase. For example, I would choose mp4-appset-1.zip:

% echo "enck" | openssl sha256
1617c29f14346b3f61efc05bbc03d301e02716d73dfc61c8450c1176525695e8

In your report, specify the command you used and the output. If you decide to collaborate with someone who has a collision (same app dataset), email the TA and instructor for an exception.

Sidenote: I calculated the smallest app by the size of the .apk. It would have been better to use the size of the (potentially multiple) .dex files inside the .apk, as there may be apps with smaller code sizes, but more graphic and audio resources.

Downloading App Datasets: Using your NCSU Google account, you can download the app datasets using the link posted to Moodle for Mini-Project 4.

Environment Setup

Amandroid is a static program analysis tool for Android applications. It works directly on .apk files, so no need to have the source code! This is because Android applications are written in Java and compiled into a special bytecode called Dalvik. Researchers have figured out how to retarget Dalvik bytecode back to Java bytecode while retaining most of the semantics. Therefore, it is much easier to perform static program analysis on Dalvik bytecode than machine code (e.g., an ARM binary). Note that Amandroid cannot analyze native code (e.g., ARM binary libraries) in an Android application.

Amandroid builds a Data Dependence Graph (DDG) by first constructing a control flow graph and calculating points-to information for each instruction. There are different ways the DDG can be used to answer interesting program analysis questions. For this assignment, you will be using the DDG to perform a specific type of program analysis called taint analysis, also known as data-flow analysis. Conceptually, a taint analysis defines a set of taint sources and a set of taint sinks. The taint sources are code locations of information that you care about. The taint sinks are code locations of where you care that information goes. The taint analysis then tracks the propagation of the information from a taint source to a taint sink. Amandroid allows you to specify taint sources and sinks via a text configuration file, or to define your own custom source and sink manager. For this assignment, you only need to use the text file.

Amandroid is an open source project. Its web page describes how to download, setup, and run the tool. Follow these instructions to setup your environment. We will be using Amandroid as a command-line tool.

Note: pay attention to the command line options. Amandroid has several modules. You only need to worry about taint analysis for privacy sensitive information. Options can be found using the command line tool. For example, java -jar argus-saf\_***-version-assembly.jar t will show information about the latest options for taint mode.

The following command line options will be useful for this assignment.

Putting it all together results in the following: java -Xmx8g -jar argus-saf\_***-version-assembly.jar t -a COMPONENT_BASED -mo DATA_LEAKAGE -o /output/path /path/to/appname.apk

The output of this command will print intermediate results, but they can also be found in /output/path/appname/result/AppData.txt. In this file you will see a list of all detected sources, followed by a list of all detected sinks, and finally zero or more paths between a source and sink.

Note: You may want to write a script to process the output of Amandroid. Alternatively, you could explore how to extend Amandroid with your own custom output.

Note: You may wish to ignore some of the pre-defined taint sources or sinks. If you are not sure what an API is used for, check the developer documentation for Android. A custom source and sink list file can be specified in the Amandroid configuration file (-i option).

Question 1: High Level Statistics (50 points)

Run Amandroid on any 10 applications from your assigned dataset (you may wish to write a script to batch the execution). Note that Amandroid can consume a large amount of RAM for its analysis, so you will likely need to increase the amount of RAM that Amandroid may use (anticipate allocating 8-16 GB if possible). If your system only has 8 GB of memory, run Amandroid on the 10 smallest applications from your dataset. Most apps should finish in less than 10 minutes, but some may take upwards of an hour, depending on the hardware used for the analysis. You can stop Amandroid and move to another application if it takes over 1 hour to process the application.

If Amandroid fails to run for less than 5 of your 10 applications only a handful of applications, simply note this in your experimental results. If it does not run for a 5 or more of your 10 applications, please contact the TA and instructor for guidance.

Note: Allocate additional memory to Amandroid using the -Xmx flag. For instance, to allocate 8 GB of RAM, use -Xmx8g. Leave at least 1GB for the host operating system.

Note: As stated above, it may be helpful to write a small script to process the output of Amandroid.

Question 1.1 (10 points): Running Amandroid

In your report, describe your experience running Amandroid. Did analysis complete successfully for every app or did only a few complete successfully. Include any other issues or experiences you encountered.

Question 1.2 (20 points): Categorizing Taint Sources and Sinks

For every app that completed successfully, create a table that aggregates the taint analysis findings. Your report should semantically group identified taint sources into the following categories: geographic location, microphone, and device identifiers. Your report should also semantically group identified taint sinks into the following categories: file, network. For each source and sink group, provide a count of the number of identified sources and sinks in each category on a per-app basis. If you wrote a short script to process the output of Amandroid, include the script in the report.

Next, creatively depict this information in a graph, showing the trends across all of the applications.

Question 1.3 (20 points): Identifying Taint Paths

For every app that completed successfully, construct a table that shows the number of data paths between source groups and sink groups using the categories defined in Question 1.2. Accompany the table with a short description offering insights and observations into the identified taint flows for each app. If you wrote a short script to process the output of Amandroid, include the script in the report.

Next, creatively depict this information in a graph, showing the trends across all of the applications.

Question 2: Determining Privacy Violations (50 points + 10 bonus points)

The existence of a data flow from a privacy-sensitive source to a network sink does not necessarily imply a privacy violation. A privacy violation occurs when the user is not reasonably aware that the flow occurs. That flow may be obvious from the user interface (e.g., location is sent when the user clicks “find my location”) or from the description of the app (e.g., a map application is expected to send location). The flow might also be stated in a privacy policy or EULA shown when the app first loads.

For this question, you will pick two applications from Question 1 that have potential privacy violations. If you don’t think that any of your applications from Question 1 are good candidates for this question, contact the TA for alternative applications.

Question 2.1 (10 points): End User License Agreement and Privacy Policy

For each of your two applications, find the EULA and privacy policy on the developer website. What data does the app collect and/or process about you? How does the company handle the data? Is it retained for a period of time? Is any data about you marketed, sold, and/or shared with third parties?

Question 2.2 (20 points): Analyzing App Behavior

Download and install Android Studio and complete the first time setup. At the “Welcome to Android Studio” screen, click on the More Actions dropdown and select Virtual Device Manager. There should be a Pixel 3a virtual device already set up. Click on the play button under actions to start the Android virtual device. After the device has started up, drag and drop the two APKs from the data set listed above into the virtual device.

For each application, launch the app and navigate through the user interface. Did the application request any personal data? List the type of data, the action you were performing in the app, and whether you believe the collection of the personal data was justified. If you believe the collection was not justified, explain why you believe it the collection of that data is a privacy violation. Remember: If the flow was stated in the privacy policy, the flow is not a privacy violation.

Note: These apps should be considered as untrusted. Do NOT sign up or sign in to any app in your dataset. If an app requests you to sign in to use it at all, just note it in your report and move on to the next app.

Question 2.3 (20 points): Analyzing the taint analysis

Compare your results above with the taint analysis report generated by Amandroid. Are there taint paths that indicate a possible privacy violation? How easy is it to identify which taint path corresponds with which part of the app?

Question 2.4 (10 bonus points): Analysis of Source Code

For each app you selected (5 points per app), disassemble the app using one of the decompilers listed below. Using the provided taint analysis results from Amandroid, locate at least one network flow in the source code and comment on your findings in the report.

Here are a few Android application decompilers / disassemblers to consider: