[Augmented Reality with Unity & Azure Spatial Anchors Series - Part 1] - Introduction

Jim Mc̮̑̑̑͒G
4 May 2020  
 17 min read

This is part one of a nine-part series that explores how to create a Unity application for a mobile device, using Augmented Reality and Azure Spatial Anchors

  • In Part 1 we introduce the project, requirements, technologies and various AR concepts.
  • In Part 2 we follow the steps in the project QuickStart guide.
  • In Part 3 we create a basic Unity ARFoundation application.
  • In Part 4 we look at one of the building blocks of AR - the “Placement Cursor”.
  • In Part 5 we explore how the UI for the project works.
  • In Part 6 we learn how Azure Spatial Anchors can be used in our project.
  • In Part 7 we step up our game by looking at how we can add and move virtual scenery.
  • In Part 8 we develop functionality that saves and restores virtual scenery to the cloud.
  • In Part 9 we learn about the Azure Functions backend for the solution.

The entire project sourcecode that accompanies this series can be found at:

What are we doing?

In this series of articles, we’re going to work through a moderately complex - but thoroughly cool - project that demonstrates how to use Augmented Reality (AR) on a mobile device.

We’ll demonstrate how we can associate 3D scenery with a real-world location and then share that experience across multiple devices.

Over the course of this series, we’re going to create an application that:

  • Uses AR technologies to locate and position our mobile device within the real world.
  • Gives us tools to place and reposition 3D “scenery” objects within the “virtual mirror” world.
  • Has functionality to save and retrieve both the real-world “anchor” and a selection of “virtual scenery” objects to the cloud.

This will allow us to return to the same physical location and observe the virtual scenery - either at a later date or using a completely different device.

The demo application will be created using Unity with ARFoundation, backed by Azure Functions and Azure Spatial Anchors (ASA) for cloud-based AR persistence.

example AR set in woodland setting


Augmented Reality (AR) is a technology that mixes 3D graphics with the real-world. If you’ve seen the mobile phone game Niantic’s “Pokemon Go” you’ll have some idea of how the technology works from a consumer perspective.

AR is a seriously exciting technology, which we can reasonably expect to see a great deal more of in the future. From a developers perspective, it’s an engaging technology to work with and maybe a subject that’s worth investing our time into learning.

Long term, it’s not unimaginable to foresee a future where we all have devices reminiscent of 2014’s Google Glass - unobtrusive headsets/glasses that overlay graphics directly into our field of vision.

But for now, until Hololense-grade technology becomes significantly smaller and less expensive, mobile devices in their current format (phones, tablets, etc) will continue being the mainstay of consumer AR technology.

Contemporary Augmented Reality (AR) encompasses a number of different hardware devices, technologies and techniques.

Early uses of AR tended to revolve around the recognition of static images. It was common to find examples where the camera of a mobile device was pointed at a specific image, such as a dollar note, a poster or perhaps a QR code. When the image was recognised, its position could be tracked and typically a virtual 3D model was superimposed over the image.

World tracking

More recently, both IOS and Android have been providing AR APIs that provide world-tracking.

This works by combining various hardware sensors to recognise and map real-world environments. Once recognised and tracked, this means that a mobile device can be localised very accurately within that environment.

In addition to world-tracking, there are other AR APIs that can be used to do things such as recognising people’s faces or bodies and used to track their movements.

Note: If you’ve ever seen novelty apps that use a selfie-cam to create real-time emoji’s or superimpose makeup or dog-ears onto a video of yourself - this is the underlying technology that drives those apps.

Once scenes, objects, or people have been recognised and tracked, virtual 3D models can be overlaid onto the camera image. All of this is done with impressive tracking accuracy.

This technology essentially works by identifying points of contrast in the video images and combining this with input from other hardware sensors, including compass and accelerometer sensors. In the most recent devices, LIDAR/TOF sensor provide additional data with which to map the physical environment.

With this input, the AR platforms can infer and generate a 3D point-cloud representation - known as a “sparse point cloud” - of the world that they have observed, which is stored in local memory.

By matching the points in the internal map against the feature-points generated in real-time by the camera, this provides a way for the device to recognise spaces and to precisely localise its position and orientation within that space.

Note: It’s worth clarifying that AR locating technology is completely unrelated to GPS location. GPS uses satellite signals and is only effective where it has a good line-of-sight to the sky (i.e. not much use indoors). Its location accuracy is typically accurate at multiple-metre scale.

By contrast, AR positioning is a visual-processing technology, which means that it can only work with what the camera sees and recognises. Unlike GPS however, it can be significantly more precise, with accuracy typically at centimetre scale.

clipart of boat anchor

What are Spatial Anchors?

With a local real-world environment mapped out, we can identify a single point within the 3D space to be used as a reference point.

These points are called a number of things including “Reference Points” and “AR Anchors” … but recently the industry seems to have settled on naming them “Spatial Anchors”.

Usually, the 3D point cloud exists only locally and for the lifecycle of the application.

Note: To introduce us to some terminology, this lifecycle of the application is called an AR Session

In order for a mobile device to recognise the same physical location at a later point in time, the point-cloud needs to be persisted to storage.

For a self-contained application, this could be local device storage - but if we want multiple users, with multiple devices, to share and collaborate in the same physical space, the point-cloud needs to be persisted in a cloud-storage service, which then lets us share this data between devices.

It is this role that Microsoft’s Azure Spatial Anchors (ASA) fulfils.

ASA is a cloud-based service supplied with application-specific SDKs. It is a relatively new service, first announced in February 2019.

At the time of writing, we have an SDK for use with the Unity game engine.

If you would like to see an engaging example of Azure Spatial Anchors in use, check out this amazing video, which demonstrates the upcoming game Minecraft Earth and a shared experience across several devices.


This series is not intended for absolute beginners - but beginners should still be ok to follow along, even if still fairly new to Unity and .NET.

These articles assume that we already have at least some familiarity using:

  • C#.
  • .NET Core for server application development.
  • Unity 2019 for mobile device application development.
  • An IDE such as Visual Studio or JetBrains Rider).

For the backend part of this project, we assume that we have:

For the Unity part of this project, we assume that we have:

We need a mobile device that supports:

  • Android - ARCore API. Minimum hardware: API level 24 - Nougat (~2016 onwards)
  • IOS - ARKit API. Minimum hardware: must have an A9 processor or higher - iPhone 6S (~2015 onwards)

The demo project was created and tested by using the following software versions:

example AR set in woodland setting

What are the technologies that we’ll be using in this project?

For the mobile device application, we’ll be using the following:

  • Unity - GameEngine - a platform for making 3D applications:

    • it is mature (first appeared in 2005) and well supported.
    • has support for many different platforms, of which mobile device support is of specific relevance in this series.
    • applications coded in C# use Mono (the original cross-platform .NET implementation)
  • ARFoundation - this is an API wrapper layer provided by Unity, providing a common way to write applications for both Android & IOS. It builds atop of the device native APIs: ARCore and ARKit.

    ARFoundation provides APIs for:

    • World tracking: track the device’s position and orientation in physical space (uses a combination of device sensors - cameras, compass, accelerometers, etc)
    • Plane detection: detect horizontal and vertical surfaces (e.g. floors, tabletops, walls)
    • Point clouds (the physical world is tracked by points of contrast in the image - also known as feature points.
    • SpatialAnchor (aka Reference points): an arbitrary position and orientation that the device tracks. If we can identify a fixed point in the real-world to be used as a reference, we can then associate 3D models with this point. This means that 3D models can always be positioned in the expected place in the real world.

Although we won’t be using the following features in this particular project, we may find it useful to know that ARFoundation also supports:

  • Light estimation: estimates for average colour temperature and brightness in physical space. (IOS supports colour, Android is monochromatic)
  • Environment probes: a way to use information from the camera with 3D models (e.g. a reflection of the physical world on a metallic-looking 3D object).
  • Face tracking: detect and track human faces.
  • Image tracking: detect and track 2D images (this is what people are often familiar with when we see demos where a 3D model appears on top of a dollar bill, QR code, poster on wall etc)
  • Object tracking: detect real-world objects.

For the cloud-based back-end of this project, we’ll be using:

  • Azure Spatial Anchors (ASA) - a cloud-based SAAS provided by Microsoft. It is used to store and retrieve spatial-point-cloud information. It provides a way to allow multiple devices to share precise location data and collaborate in the same physical space.

  • Azure Functions - a cloud-based “serverless” PAAS. Specifically, we will be using http-triggers to provide a RESTful backend service.

    • The service will have one API to store and retrieve a simple ID value
    • The service will have a second API to store and retrieve serialised information about the 3D positioning of virtual scenery. The data will be serialised in the JSON format.
  • Azure Blob Storage - a cloud-based PAAS, which we will be using to store simple text files generated by our APIs.

About the code in this demo

The code in this project should be considered as “demoware”.

In an attempt to keep the examples clearer and shorter, things such as error handling, logging or unit testing etc. have been purposefully excluded.

The code has been written in a way that always assumes an “execution happy path” and is not intended to exemplify any best practices in software development.

clipart of documents

Primary documentation

Unity AR Foundation

Azure Spatial Anchors

The sample code, above, contains everything we need to know. The Microsoft Mixed Reality team have created it to exemplify a number of use cases, such as different device clients (e.g. Hololens and Android) and different aspects of the overall ASA service (e.g. coarse location). The only catch, is that its broader scope requires a little more “cherry-picking effort” to get to just the bits that are of interest to us, in our own project.

Azure Functions and Storage

Other resources

Finally, I’ve authored other articles on this blogsite, which may also be useful references:

Azure Spatial Anchors (ASA) is in Preview

At the time of writing, Azure Spatial Anchors is a preview technology and Microsoft has not yet fully released it as a supported commercial service.

In choosing to explore and experiment with this exciting new technology, we have to appreciate that this is potentially an unrefined experience and that we should set our expectations accordingly:

  • The service is currently free, but this may change in the future.
  • The FAQ documentation identifies that the API has throttling constraints.
  • The SDK and service may be fragile and/or subject to breaking changes.
  • Although there is already a large amount of documentation produced by Microsoft, we should be respectful that the docs are relatively new and may still need to expand and mature.

Note: A knock-on effect of working with fast-moving technology, is that this blog series is itself likely to become outdated quickly - so check the date of publish and adapt as needed, using official documentation and resources!

Blog Update 22 May: During BUILD 2020 conference in May, it was announced onstage that Azure Spatial Anchors is now GA. According to the updated website, pricing details will be made available 1 July 2020.

clipart woman painted face

AR Concepts.

We should take a few minutes to learn about some of the various concepts that are relevant to this demo.

AR Sessions

  • When we start the app, a new native AR world-tracking Session is started.
  • We extend the native tracking session with an Azure Spatial Anchor session and instruct the ASA SDK to make the necessary online connections (using credentials that we provide to the “ASA Manager” Unity component)
  • Upon application startup, the device immediately begins tracking its physical position and orientation in the real world, using a combination of available hardware sensors (i.e. camera, compass, accelerometer, gyroscope, ToF, etc).

This happens opaquely to application developers; we never get involved with, or get to see, the complexity that is happening behind the scenes in the ARCore/ARKit layers.

  • The device analyses each frame of video from the camera, looking for points of high-contrast (called feature-points).
  • As we move the device around the physical space, it constructs an in-memory point-cloud 3D map of these feature-points. As we continue to move our device around, the internal map is expanded, updated and refined.
    • This is stored internally and is managed opaquely by the ARFoundation/ARCore/ARKit layers.
  • In practice, we need to move our mobile device around the scene for best results.
    • Moving side-to-side creates a parallax effect that helps the platform perceive 3D depth from just a single camera.
    • We will get better results if we “move in an orbit” around a point of interest, whilst facing it.
    • Simply standing in one place and rotating about a fixed point, as if to take a panoramic photo, may not be useful when mapping the environment.
  • The device can track its position, relative to this internal 3D map, as we physically move away from where we started.

Tip: During experimentation at my local park, I found that I could easily walk about 50 metres away, turn around and see my 3D objects displaying “in the distance” just fine.

Tracking accuracy does become lost at that range - we can expect to see “scenery drift”. This is because the device is using estimations about its location relative to the anchor.

However, when we walk back to the anchor, the device will eventually recognise the anchor and “re-localise”. We may see this manifest with “scenery snapping back into the correct position”.

sparse point cloud

sparse point cloud

AR Coordinates

It is really, really important to understand the 3D coordinate system of an AR Session:

  • The origin of the 3D world (i.e.: 0,0,0 coordinates) is the position and orientation (in Unity this is collectively called a Pose) of the mobile device - at the moment that we started the current AR Session.
    • This means that we should always expect that the coordinate origin of the virtual world to correspond to a different real-world Pose, for each and every AR Session!
    • To clarify, this could mean that “Up” in the real world, will almost certainly not equate to the “y” axis of our virtual world.
    • If we have an experience that is shared across multiple devices, they will all have different real-world positions with coordinate systems that do not correlate.

To help visualise this:

  • Imagine that we start the app and then walk around the room.
  • We then send a command to draw a new cube at coordinates (0,0,0). We deliberately leave the cube with a default orientation.
  • If we were to look around the room using the app, we would expect to find the cube floating in the exact spot where our mobile device was when the app started.

    … in fact, the cube would probably be tilted to face the floor slightly, as this is probably how we would have been holding our phone at the start!

  • Now that we understand that the coordinate origin is never consistent, we need to realise that we can not save the Pose data of any “scenery objects” by using absolute coordinates.
  • Instead, “scenery objects” need to be associated with the spatial-anchor GameObject, as this is the only object in the virtual world that can be expected to have a Pose, that is consistent with the real-world, between AR Sessions.

    • “Scenery objects” are added as children of the spatial-anchor object.
    • When saving coordinate information to the cloud, we need to use coordinates that are relative to the parent anchor - and not global coordinates.
  • An AR Session only exists for the duration of our use. When we stop the app, everything is lost and we need to start over.

AR Plane detection

In an AR demo, where we see things like 3D cubes being placed “on the floor”, there is actually quite a lot going on that may not be immediately apparent.

In the real world, we deal with flat surfaces all of the time - the floor, worktops, walls etc.

When working with AR, we need to be able to determine how those real-world surfaces correlate to a virtual counterpart.

The AR APIs provide a way to detect horizontal and/or vertical surfaces. This surface detection provides us with something called a 3D Plane.

We don’t necessarily need to see a visualisation of the plane; however, things like dots or partially-transparent blocks can help developers and users better understand what the app is doing. In a version of the application that our consumers use, it may be that we choose not to include any visualisation.

Raycast detection

If we want to interact with a plane or any other virtual 3D object, we need to use something called a RayCast.

Think of a RayCast as being a bit like a laser-pointer that projects, from our device (in the direction that we are pointing it) - until it collides with something in our virtual world.

When an intersection is detected, we can then get the Pose of that intersection.

We can use the new Pose information as required. For example, we could use it to set the pose of a virtual “cursor”, or we could use it when spawning a new 3D object.

This would allow those objects to be correctly orientated with a Plane and give the illusion that the object is resting atop a real-world surface.

Next, in part two, we follow the steps in the project QuickStart guide.

NEXT: Read part 2


2020 (19)
2019 (27)