Mini-Project 4: Privacy #
The goal of this project is to gain a deeper understanding of mobile application privacy and to develop skills for empirical investigation.
Specifically, students will use the Amandroid (also called Argus-SAF) static program analysis tool to identify privacy leaks in Android applications.
For example, it can identify if privacy-sensitive information from the deviceID()
source API flows to a network send()
sink API. Amandroid will be applied to a small corpus of applications, and a report will classify and describe the findings.
Points: Mini-Project 4 has a maximum of 100 points and an additional 10 bonus points.
Due: Mini-Project 4 is due on Tue, Dec 3 - 11:59pm .
Collaboration:
- You may not collaborate on this mini-project. The project should be done individually.
- You may search the Internet for help, but you may not copy (either copy-and-paste or manual typing) code from another source without proper attribution.
- Students are encouraged to discuss the project with each other. Program analysis tools can sometimes be tricky to get working, and class discussion will facilitate understanding. The only rule is that the “findings” should be unique. One way to ensure uniqueness is for collaborating students to work on different sets of applications.
- If you discuss the project or any insights with another student, you must list their name and their contribution to your work (i.e., “helped me with getting Amandroid running on MacOS for Q1.1”) on your solution PDF in a clearly denoted “Acknowledgements” section.
Posting Solutions: You are explicitly forbidden from posting your solution in a public form (e.g., GitHub). If you need to share your solution as part of a job interview, you should create a private repository and grant that individual access. Please ask the instructor if you have any questions or concerns.
Dataset: From a May 2019 snapshot of the top 500 most popular free apps in the Google Play Store, an approximation of the apps used for the PoliCheck study.
These applications were detected to have privacy leaks during dynamic analysis.
The smallest 400 apps (because Amandroid is quite slow) were distributed in to 16 buckets using a round-robin sorting algorithm (each bucket has about the same distribution of file sizes) and packed into .zip files for each bucket: mp4-appset-[0-F].zip
.
There are 25 apps in each dataset.
You should choose the app dataset that corresponds to the first hexadecimal digit of the SHA2-256 hash of your last name plus current year in lowercase (and no newline). For example, I would choose mp4-appset-0.zip
:
% echo -n "enck2024" | openssl sha256
09338b2151ccc7426e7d5a83b94d1c26ff6304f929b30fcedb496fc1317927bd
In your report, specify the command you used and the output. If you exchange information with someone who has a collision (same app dataset), clearly state this and the exchanged information in your “Acknowledgement” section.
Downloading App Datasets: Using your NCSU Google account, you can download the app datasets using the link posted to Moodle for Mini-Project 4.
Submission #
What to submit: This assignment should be submitted to GradeScope. The submission should include a PDF document that provides your written answers to the questions for each part.
Introduction and Setup #
Amandroid is a static program analysis tool for Android applications.
It works directly on .apk
files, so no need to have the source code!
This is because Android applications are written in Java and compiled into a special bytecode called Dalvik.
Researchers have figured out how to retarget Dalvik bytecode back to Java bytecode while retaining most of the semantics.
Therefore, it is much easier to perform static program analysis on Dalvik bytecode than machine code (e.g., an ARM binary).
Note that Amandroid cannot analyze native code (e.g., ARM binary libraries) in an Android application.
Amandroid builds a Data Dependence Graph (DDG) by first constructing a control flow graph and calculating points-to information for each instruction. There are different ways the DDG can be used to answer interesting program analysis questions. For this assignment, you will be using the DDG to perform a specific type of program analysis called taint analysis, also known as data-flow analysis. Conceptually, a taint analysis defines a set of taint sources and a set of taint sinks. The taint sources are code locations of information that you care about. The taint sinks are code locations of where you care that information goes. The taint analysis then tracks the propagation of the information from a taint source to a taint sink. Amandroid allows you to specify taint sources and sinks via a text configuration file, or to define your own custom source and sink manager. For this assignment, you only need to use the text file.
Amandroid is an open source project. Its web page describes how to download, setup, and run the tool. Follow these instructions to setup your environment. We will be using Amandroid as a command-line tool.
Note: pay attention to the command line options.
Amandroid has several modules.
You only need to worry about taint analysis for privacy sensitive information.
Options can be found using the command line tool.
For example, java -jar argus-saf\_***-version-assembly.jar t
will show information about the latest options for taint
mode.
The following command line options will be useful for this assignment:
java -Xmx8g -jar ...
: The -Xmx option for java tells Java how much RAM to allocate t the analysis. For example, -Xmx8g specifies 8GB of RAM. It is recommended to set this as high as your hardware can reasonably support to increase the chance that taint analysis will complete successfully. Leave at least 1GB left over for the host.-mo
: You will need to specify the type of taint analysis you wish to perform. For this project, you are interested in the DATA_LEAKAGE analysis.-i
: You may wish to define a custom configuration file (e.g., to define a custom source and sink list file).-o
: Specify an output path for this analysis. You will want to use different output paths for each app.-a COMPONENT_BASED
: This option was added to handle Inter Component Communication (ICC) in a more scalable way. The documentation suggests that if you use this approach, your configuration should turn offresolve_icc
. I expect that Amandroid will perform the analysis of each component separately, and then attempt to connect ICC using the program graphs created for each component. Hence, using the-a COMPONENT_BASED
option should make the analysis take less RAM and finish faster. As project documentation does not always keep up with code, some experimentation may be needed here. Putting it all together results in the following:
java -Xmx8g -jar argus-saf\_***-version-assembly.jar t -a COMPONENT_BASED -mo DATA_LEAKAGE -o /output/path /path/to/appname.apk
The output of this command will print intermediate results, but they can also be found in /output/path/appname/result/AppData.txt
. In this file you will see a list of all detected sources, followed by a list of all detected sinks, and finally zero or more paths between a source and sink.
Note: You may want to write a script to process the output of Amandroid. Alternatively, you could explore how to extend Amandroid with your own custom output.
Note: You may wish to ignore some of the pre-defined taint sources or sinks. If you are not sure what an API is used for, check the developer documentation for Android. A custom source and sink list file can be specified in the Amandroid configuration file (-i
option).
Task 1: High-Level Statistics #
(50 points)
Run Amandroid on any 10 applications from your assigned dataset (you may wish to write a script to batch the execution). Note that Amandroid can consume a large amount of RAM for its analysis, so you will likely need to increase the amount of RAM that Amandroid may use (anticipate allocating 8-16 GB if possible). If your system only has 8 GB of memory, run Amandroid on the 10 smallest applications from your dataset. Most apps should finish in less than 10 minutes, but some may take upwards of an hour, depending on the hardware used for the analysis. You can stop Amandroid and move to another application if it takes over 1 hour to process the application.
- If Amandroid runs for more than 5 of your 10 applications (i.e. only a handful of applications fail), simply note this in your experimental results.
- If it does not run for 5 or more of your 10 applications, please contact the TA and instructor for guidance (likely an alternative app set).
Note: Allocate additional memory to Amandroid using the -Xmx
flag. For instance, to allocate 8 GB of RAM, use -Xmx8g
. Leave at least 1GB for the host operating system.
Note: As stated above, it may be helpful to write a small script to process the output of Amandroid.
Question 1.1: Running Amandroid #
(10 of the 50 points)
In your report, describe your experience running Amandroid. Did analysis complete successfully for every app or did only a few complete successfully. Include any other issues or experiences you encountered.
Question 1.2: Categorizing Taint Sources and Sinks #
(20 of the 50 points)
For every app that completed successfully, create a table that aggregates the taint analysis findings.
- Your report should semantically group identified taint sources into the following categories: geographic location, microphone, and device identifiers.
- Your report should also semantically group identified taint sinks into the following categories: file, network.
- For each source and sink group, provide a count of the number of identified sources and sinks in each category on a per-app basis.
If you wrote a short script to process the output of Amandroid, include the script in the report.
Next, creatively depict this information in a graph, showing the trends across all of the applications.
Question 1.3: Identifying Taint Paths #
(20 of the 50 points)
For every app that completed successfully, construct a table that shows the number of data paths between source groups and sink groups using the categories defined in Question 1.2. Accompany the table with a short description offering insights and observations into the identified taint flows for each app. If you wrote a short script to process the output of Amandroid, include the script in the report.
Next, creatively depict this information in a graph, showing the trends across all of the applications.
Task 2: Determining Privacy Violations #
(50 points + 10 bonus points)
The existence of a data flow from a privacy-sensitive source to a network sink does not necessarily imply a privacy violation. A privacy violation occurs when the user is not reasonably aware that the flow occurs. That flow may be obvious from the user interface (e.g., location is sent when the user clicks “find my location”) or from the description of the app (e.g., a map application is expected to send location). The flow might also be stated in a privacy policy or EULA shown when the app first loads.
For this question, you will pick two applications from Question 1 that have potential privacy violations. If you don’t think that any of your applications from Question 1 are good candidates for this question, contact the TA for alternative applications.
Question 2.1: End User License Agreement and Privacy Policy #
(10 of the 50 points)
For each of your two applications, find the EULA and privacy policy on the developer website. What data does the app collect and/or process about you? How does the company handle the data? Is it retained for a period of time? Is any data about you marketed, sold, and/or shared with third parties?
Question 2.2: Analyzing App Behavior #
(20 of the 50 points)
Download and install Android Studio and complete the first time setup. At the “Welcome to Android Studio” screen, click on the More Actions dropdown and select Virtual Device Manager. There should be a Pixel 3a virtual device already set up. Click on the play button under actions to start the Android virtual device. After the device has started up, drag and drop the two APKs from the data set listed above into the virtual device.
For each application, launch the app and navigate through the user interface. Did the application request any personal data? List the type of data, the action you were performing in the app, and whether you believe the collection of the personal data was justified. If you believe the collection was not justified, explain why you believe it the collection of that data is a privacy violation. Remember: If the flow was stated in the privacy policy, the flow is not a privacy violation.
Note: These apps should be considered as untrusted. Do NOT sign up or sign in to any app in your dataset. If an app requests you to sign in to use it at all, just note it in your report and move on to the next app.
Question 2.3: Analyzing the Taint Analysis #
(20 of the 50 points)
Compare your results above with the taint analysis report generated by Amandroid. Are there taint paths that indicate a possible privacy violation? How easy is it to identify which taint path corresponds with which part of the app?
Question 2.4: Analysis of Source Code #
(10 bonus points)
For each app you selected (5 points per app), disassemble the app using one of the decompilers listed below. Using the provided taint analysis results from Amandroid, locate at least one network flow in the source code and comment on your findings in the report.
Here are a few Android application decompilers / disassemblers to consider: