Your awesome title

What’s New in the ML.NET CLI

2023-01-22T00:00:00+00:00

The ML.NET CLI has gotten some interesting updates. This post will go over the main items that are new.

For a video version of this post, check below.

New Install Name

The first thing to make note of is that there is a new name when installing the newer versions of the ML.NET CLI. Since the file size got too big for a single .NET tool, it is now split up into multiple installs depending on what operating system and CPU architecture you’re running.

So getting the newest version will require a new install even if you have the older version installed. Actually, I would recommend to go ahead and uninstall the older version of the CLI if you already have it installed. This can be done with the dotnet tool uninstall mlnet --global command.

So depending on your machine is what you will install. I have a M1 MacBook Pro, so I would install the mlnet-osx-arm version. If you’re on Windows, you will probably be installing the mlnet-win-x64 version.

If you want to update a previously installed newer version, you can use the dotnet tool update command.

Train with a `mbconfig` File

With the new CLI release, it comes with a couple of new command. The first we’ll go over is the train command. This takes in a single required argument, which is a mbconfig file. This will use the information in the mbconfig file and will perform another training run.

This can be good for a few scenarios, including continuous integration where the mbconfig file is checked into version control and can be run each day to see if a new model can be discovered.

Forecasting

Along with the train command a new scenario has been added - forecasting. Forecasting is primarily used for time series data to forecast values in the future. Similar to the other scenarios, we have a few arguments we can pass in.

The dataset and label-col arguments are similar to the other scenarios, but forecasting has a couple of others that are required - horizon and time-col .

The horizon argument is simply the number of items in the future you want the forecasting algorithm to predict.

The time-col argument is just the column that has the time or dates that the algorithm can use.

And we can run this like other scenarios with the below command. We’ll let it run only for 10 seconds with the --train-time argument. The data can be found here if you want to run it as well.

mlnet forecasting --dataset C:/dev/wind_gen.txt --horizon 3 --label-col 1 --time-col 0 --train-time 10

A couple of big additions to the CLI and I’m sure more are coming. It is nice to see that the ML.NET team is continuing to keep the CLI’s features on par with Model Builder.

Introduction to QnA Maker

2022-01-17T00:00:00+00:00

Suppose you have a FAQ page that has a lot of data and want to use that as a first line of customer service support for a chat bot on your main page. How can you integrate this with minimal effort? Enter Microsoft Q&A Maker.

For the video verison of this post, check below.

What is Microsoft QnA Maker

From the official docs it states that…

QnA Maker is a cloud-based Natural Language Processing (NLP) service that allows you to create a natural conversational layer over your data.

So, basically, this service allows you to creates question and answering based on the data that you have.

Creating the Azure Resource

First thing, like for all Cognitive Services items, we need to create the Azure resource. When creating a new resource, you can search for “QnA” and select “QnA Maker”.

After clicking “Create” on that, we’re on a screen where we have to enter a few things. First, we will supply the subscription, resource group, name, and pricing tier. Note that this does have a free tier so this can be used for proof of concepts or to simply try out the service to see if it meets your needs.

Next, it will ask for details for Azure Search. This is used to host the data that you give it for the QnA maker and Azure Search is used for that. Only the location and pricing tier is needed for this. This has a free tier, as well.

Last, it will ask details for an App Service. This is used to host the runtime for QnA maker which will be used to make queries against your data. The app name and location is required for this.

You can optionally enable app insights for this service as well, but feel free to disable that since it’s not really needed unless you are creating this for actual business purposes and want to see if anything goes wrong.

With all that, we can click the “Review and Create” button to create and deploy the resources.

Creating a QnA Knowledge Base

With the resource created we can now go to it. Per usual with the Cognitive Services items, we have a “Keys and Endpoint” item on the left where we can get the key and endpoint. This will be used later when using the API.

But first, we need to create our QnA Knowledge Base and to do that we need to go to the “Overview” tab on the left navigation. A bit down it has a link where we can go to the QnA portal to create a new knowledge base.

We can skip step one since we already created the Azure resource. In step two, we will connect the QnA portal to our Azure resource. Simply give it the subscription and the Azure service name. Luckily, all of this pre-populates so they are all dropdowns.

In step three, we will give our knowledge base a name. We’ll name it “ikea” since we will use the Ikea FAQ to populate the knowledge base.

Step four is where we’ll populate the knowledge base. If you already have a FAQ on your website you can put the URL in. Since I’m using the Ikea FAQ we can do that. You can also add a file for this. If you have neither, you can leave this blank and fill out the questions and answers manually on the next page.

Below this you can customize your bot with a chit-chat. This just helps give your QnA bot a bit more personality based on what you select. Here’s a screenshot of an example from the docs of each of the chit-chat items that you can choose from.

For step five, we can create our knowledge base by clicking on the button.

Training and Testing

Once we have our knowledge base created we can look see that it read in the Ikea FAQ quite well. To see just how well QnA Maker is, we can instantly click on the “Save and train” button to train a model on our data.

Once that finishes we can click on the “Test” button to give the model a test. This is an integrated chat bot where we can ask it questions and will receive answers based on the model.

So we can ask “What is the return policy?” and QnA Maker will give us the best answer based on our data.

Early on we get some good results from QnA Maker. But what if we want to add a new pair?

Adding a New QnA Pair

If we want to add new QnA pairs to our existing knowledge base, just click on the Add QnA Pair button.

We can add an alternate phrasing such as “This is a question”. A phrasing is mostly a question that a user would input into the system that will get sent to QnA Maker. We can input an answer as well, such as “This is an answer”. Notice that we have a rich text editor which can be toggled in the upper left. With this, we can add graphics, links, and emojis. Let’s add a smile emoji to our answer.

Now we can click the “Save and train” button to train a new model on what we just added. We can then do another test and in put “This is a question” and we should get the answer that we put as the output.

Using the API

Before we can actually use the API we need to publish our knowledge base. Simply click the “Publish” tab near the top and then the “Publish” button.

Once that completes you can either create a bot that will use this knoweldge base, or you can use the API directly. We’ll use the API and the publish page shows how you can call the API using Postman or through curl. We’ll use Postman here so we can easily test the API out.

To build the URL for the API, use the “Host” item in the deployment details and then append the first item where it has “POST” after that.

And since it does say “POST” we will make this as a POST call.

Next, we need to set the authorization header. In the Authorization tab in Postman, set the type to “API Key”. The “Key” item will be “Authorization” and the “Value” will be the API key which is the third part of the deployment details.

Now, we can add in the JSON body and the simplest JSON that we can add has only one item, a “question” item. And this is the prompt that a user would send to QnA Maker. Let’s add the question that we added earlier.

{
 "question":"This is a question"
}

Once we hit “Send” in Postman, we will get the below response.

{
     "answers": [
     {
         "questions": [
             "This is a question"
         ],
         "answer": "This is an answer😀",
         "score": 100.0,
         "id": 176,
         "source": "Editorial",
         "isDocumentText": false,
         "metadata": [],
         "context": {
             "isContextOnly": false,
             "prompts": []
         }

    }
 ],
 "activeLearningEnabled": false
}

The main part to notice here is the “answer”, which is what we expect to get back.

Hopefully, this showed how useful the QnA Maker can be, especially if you already have a FAQ page with questions and answers. With QnA Maker it can be turned into a chat bot or any other automation tool where users or customers may need to ask questions.

Use Bing Image Search to Get Training Image Data

2021-12-06T00:00:00+00:00

When going through the FastAI book, Deep Learning for Coders, I noticed that in one of the early chapters they mention using the Bing Image Search API to retrieve images for training data. While they have a nice wrapper for the API, I thought I’d dive into the API as well and use it to build my own way to download training image data.

Let’s suppose we need to make an image classification model to determine what kind of car is in an image. We’d need quite a bit of different images for this model, so let’s use the Bing Image Search to gather images of the Aston-Martin car so we can start getting our data.

Check out the below for a video version of this post.

Why Bing Image Search

Before going into the technical side of Bing Image Search, let’s go over why use this in the first place. Why not just download the images ourselves?

Bing Images Search has a few features in it that we can utilize in our code when getting our images. Some of these features are important to take into account.

Automation

I’ll be honest, I’m lazy and if I can script something to do a task for me then I’ll definitely spend the time to build the script rather than do the task manually. Rather than manually finding images and downloading them, we can use the Bing Image Search API to do this for us.

Image License

We can’t always just take an image from a website and use it however we want. A lot of images that are online are copyrighted and if we don’t have a license to use the copyright we are actually in violation of the creator’s copyright. If they find out we use their image without a license or permission then they can, more than likely, take legal action against us.

However, with Bing Image Search, we have an option to specify what license the images has that get returned to us. We can do this with the licenseType query parameter in our API call. This utilizes Creative Commons licenses. We can specify exactly what type of license our images has. We can specify that want images that are public where the copyright is fully waived, which is what we will do. There are many Creative Commons license types that the Bing Image Search supports and there’s a full list here.

Image Type

There are quite a few images types that we could download from Bing Image Search. For our purposes, though, we only want photos of Aston Martin cars. Due to that, we can specify the image type in our API calls to just photo. If we don’t specify this we could get back animated GIFs, clip art, or drawings of Aston Martin cars.

Safe Content

When downloading images from the internet you never really know what you’re going to get. Bing Image Search can help ease that worry by specifying that you want only safe content to be returned.

Bing can do this filtering for us so we don’t have to worry about it when we do our API call. This is one less thing we have to worry about and, because it’s the internet, it’s definitely something to worry about when download images.

Create Azure Resource

Before we can use the Bing Image Search API we need to create the resource for it. In the Azure Portal create a new resource and search for “Bing Search”. Then, click on the “Bing Search v7” resource to create it.

When creating the resource give it a name, what resource group it will be under, and what subscription it will be under. For the pricing tier, it does have a free tier to allow you to give the service a try for evaluation or for a proof of concept. Once that is complete, click “Create”.

When that completes deployment, we can explore a bit on the resource page. One thing to note is that there are a few things we can look at here. There’s a “Try me” tab where we can try the Bing Search API and see what results we get. There is some sample code to see real quick how to use the Bing Search API. And there are a lot of tutorials that we can look at if we want to look at something more specific, such as the image or video search APIs.

Retrieve Key and Endpoint

To use the API in our code we will need the API key and the endpoint to call. There are a couple of ways we can get to it. First, on the “Overview” page of the resource there’s a link that says to “click here to manage keys”. Clicking that will take you to another page where you can get the API keys and the endpoint URL.

You can also click on the “Keys and Endpoint” section on the left navigation.

Now save the API key and the endpoint since we’ll need those to access the API in the code.

Using the API

Now we get to the fun stuff where we can get into some code. I’ll be using Python, but you’re very welcome to use the language of your choice since this is a simple API call. I’m also using Azure ML since it’s very easy to get a Jupyter Lab instance running plus most machine learning and data science packages already installed.

Imports

First, we need to import some modules. We have four that we will need to import.

JSON: This will be used to read in a config file for the API key and endpoint
Requests: Will be used to make the API calls. This is pre-installed in an Azure ML Jupyter instance, so you may need to run pip install requests if you are using another envrionment.
Time: Used to delay API calls so the server doesn’t get hit too much by requests.
OS: Used to saved and help clean image data on the local machine.
PPrint: Used to format JSON when printing.

import json
import requests
import time
import os
import pprint

The API Call

Now, we can start building and making the API call to get the image data.

Building the Endpoint

To start building the call, we need to get the API key which is kept in a JSON file for security reasons. We’ll use the open method to open the file to be able to read it and use the json module to load the JSON file. This creates a dictionary where the JSON keys are the key names of the dictionary where you can get the values.

config = json.load(open("config.json"))

api_key = config["apiKey"]

Now that we have the API key we can build up the URL to make the API call. We can use the endpoint that we got from the Azure Portal and help build up the URL.

endpoint = "https://api.bing.microsoft.com/"

With the endpoint, we have to add some to it to tell it that we want the Image Search API. To learn more about the exact endpoints we’re using here, this doc has a lot of good information.

url = f"{endpoint}v7.0/images/search"

Building the Headers and Query Parameters

Some more information we need to add to our call are the headers and the query parameters. The headers is where we supply the API key and the query parameters detail what images we want to return.

Requests makes it easy to specify the headers, which is done as a dictionary. We need to supply the Ocp-Apim-Subscription-Key header for the API key.

headers = { "Ocp-Apim-Subscription-Key": api_key }

The query parameters are also done as a dictionary. We’ll supply the license, image type, and safe search parameters here. Those are optional parameters, but the q parameter is required which is what query we want to use to search for images. For our query here, we’ll search for aston martin cars.

params = {
    "q": "aston martin", 
    "license": "public", 
    "imageType": "photo",
    "safeSearch": "Strict",
}

Making the API Call

With everything ready, we can now make the API call and get the results. With requests we can just call the get method. In there we pass in the URl, the headers, and the parameters. We use the raise_for_status method to throw an exception if the status code isn’t successful. Then, we get the JSON of the response and store that into a variable. Finally, we use the pretty print method to print the JSON response.

response = requests.get(url, headers=headers, params=params)
response.raise_for_status()

result = response.json()

pprint.pprint(result)

And here’s a snapshot of the response. There’s quite a bit here but we’ll break it down some later in this post.

{'_type': 'Images',
 'currentOffset': 0,
 'instrumentation': {'_type': 'ResponseInstrumentation'},
 'nextOffset': 38,
 'totalEstimatedMatches': 475,
 'value': [{'accentColor': 'C6A105',
            'contentSize': '1204783 B',
            'contentUrl': '[https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg](https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg)',
            'creativeCommons': 'PublicNoRightsReserved',
            'datePublished': '2021-02-06T20:45:00.0000000Z',
            'encodingFormat': 'jpeg',
            'height': 1530,
            'hostPageDiscoveredDate': '2021-01-12T00:00:00.0000000Z',
            'hostPageDisplayUrl': '[https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car](https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car)',
            'hostPageFavIconUrl': '[https://www.bing.com/th?id=ODF.lPqrhQa5EO7xJHf8DMqrJw&pid=Api](https://www.bing.com/th?id=ODF.lPqrhQa5EO7xJHf8DMqrJw&pid=Api)',
            'hostPageUrl': '[https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car](https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car)',
            'imageId': '38DBFEF37523B232A6733D7D9109A21FCAB41582',
            'imageInsightsToken': 'ccid_WTqn9r3a*cp_74D633ADFCF41C86F407DFFCF0DEC38F*mid_38DBFEF37523B232A6733D7D9109A21FCAB41582*simid_608053462467504486*thid_OIP.WTqn9r3aKv5TLZxszieEuQHaF5',
            'insightsMetadata': {'availableSizesCount': 1,
                                 'pagesIncludingCount': 1},
            'isFamilyFriendly': True,
            'name': 'Aston Martin Car Free Stock Photo - Public Domain '
                    'Pictures',
            'thumbnail': {'height': 377, 'width': 474},
            'thumbnailUrl': '[https://tse2.mm.bing.net/th?id=OIP.WTqn9r3aKv5TLZxszieEuQHaF5&pid=Api](https://tse2.mm.bing.net/th?id=OIP.WTqn9r3aKv5TLZxszieEuQHaF5&pid=Api)',
            'webSearchUrl': '[https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=aston+martin&id=38DBFEF37523B232A6733D7D9109A21FCAB41582&simid=608053462467504486](https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=aston+martin&id=38DBFEF37523B232A6733D7D9109A21FCAB41582&simid=608053462467504486)',
            'width': 1920}]

A few things to note from the response:

nextOffset: This will help us page items to perform multiple requests.
value.contentUrl: This is the actual URL of the image. We will use this URL to download the images.

Paging Through Results

For a single API call we may get around 30 items or so by default. How do we get more images with the API? We page through the results. And the way to do this is to use the nextOffset item in the API response. We can use this value to pass in another query parameter offset to give the next page of results.

So if I only want at most 200 images, I can use the below code to page through the API results.

new_offset = 0

while new_offset <= 200:
    print(new_offset)
    params["offset"] = new_offset

    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()

    result = response.json()

    time.sleep(1)

    new_offset = result["nextOffset"]

    for item in result["value"]:
        contentUrls.append(item["contentUrl"])

We initialize the offset to 0 so the initial call will give the first page of results. In the while loop we limit to just 200 images for the offset. Within the loop we set the offset parameter to the current offset, which will be 0 initially. Then we make the API call, we sleep or wait for one second, and we set the offset parameter to the nextOffset from the results and save the contentUrl items from the results into a list. Then, we do it again until we reach the limit of our offset.

Downloading the Images

In the previous API calls all we did was capture the contentUrl items from each of the images. In order to get the images as training data we need to download them. Before we do that, let’s set up our paths to be ready for images to be downloaded to them. First we set the path and then we use the os module to check if the path exists. If it doesn’t, we’ll create it.

dir_path = "./aston-martin/train/"

if not os.path.exists(dir_path):
    os.makedirs(dir_path)

Generally, we could just do the below code and loop through all of the content URL items and for each one we create the path with the os.path.join method to get the correct path for the system we’re on, and open the path with the open method. With that we can use requests again with the get method and pass in the URL. Then, with the open function, we can write to the path from the image contents.

for url in contentUrls:
    path = os.path.join(dir_path, url)

    try:
        with open(path, "wb") as f:
            image_data = requests.get(url)

            f.write(image_data.content)
    except OSError:
        pass

However, this is a bit more complicated than we would hope it would be.

Cleaning the Image Data

If we print the image URLs for all that we get back it would look something like this:

https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg
https://images.pexels.com/photos/592253/pexels-photo-592253.jpeg?auto=compress&cs=tinysrgb&h=750&w=1260
https://images.pexels.com/photos/2811239/pexels-photo-2811239.jpeg?cs=srgb&dl=pexels-tadas-lisauskas-2811239.jpg&fm=jpg
https://get.pxhere.com/photo/car-vehicle-classic-car-sports-car-vintage-car-coupe-antique-car-land-vehicle-automotive-design-austin-healey-3000-aston-martin-db2-austin-healey-100-69398.jpg
https://get.pxhere.com/photo/car-automobile-vehicle-automotive-sports-car-supercar-luxury-expensive-coupe-v8-martin-vantage-aston-land-vehicle-automotive-design-luxury-vehicle-performance-car-aston-martin-dbs-aston-martin-db9-aston-martin-virage-aston-martin-v8-aston-martin-dbs-v12-aston-martin-vantage-aston-martin-v8-vantage-2005-aston-martin-rapide-865679.jpg
https://c.pxhere.com/photos/5d/f2/car_desert_ferrari_lamborghini-1277324.jpg!d

Do you notice anything in the URLs? While most of then end in jpeg there are a few with some extra parameters on the end. If we try to download with those URLs we won’t get the image. So we need to do a little bit of data cleaning here.

Luckily, there are two patterns we can check, if there is a ? in the URL and if there is a ! in the URL. With those patterns we can update our loop to download the images to the below to get the correct URLs for all images.

for url in contentUrls:
    split = url.split("/")

    last_item = split[-1]

    second_split = last_item.split("?")

    if len(second_split) > 1:
        last_item = second_split[0]

    third_split = last_item.split("!")

    if len(third_split) > 1:
        last_item = third_split[0]

    print(last_item)
    path = os.path.join(dir_path, last_item)

    try:
        with open(path, "wb") as f:
            image_data = requests.get(url)
            #image_data.raise_for_status()

            f.write(image_data.content)
    except OSError:
        pass

With this cleaning of the URLs we can get the full images.

Conclusion

While this probably isn’t as sophisticated as the wrapper that FastAI has, this should help if you need to get training images from Bing Image Search manually. You can also tweak this if needed.

Using Bing Image Search is a great way to get quality and license appropriate images for training data.

The ML.NET Deep Learning Plans

2021-09-13T00:00:00+00:00

One of the most requested features for ML.NET is the ability to create neural networks models from scratch to perform deep learning in ML.NET. The ML.NET team has taken that feedback and the feedback from the customer survey and has come out with a plan to start implementing this feature.

Current State of Deep Learning in ML.NET

Currently, in ML.NET, there isn’t a way to create neural networks to have deep learning models from scratch. There is great support for taking an existing deep learning model and using it for predictions, however. If you have a TensorFlow or ONNX model then those can be used in ML.NET to make predictions.

There is also great support for transfer learning in ML.NET. This allows you to take your own data and train it against a pretrained model to give you a model of your own.

However, as mentioned earlier, ML.NET does not yet have the capability to let you create your own deep learning models from scratch. Let’s take a look at what the plans are for this.

Future Deep Learning Plans

In the ML.NET GitHub repo there is an issue that was fairly recently created that goes over the plans to implement creating deep learning models in ML.NET.

There are two reasons for this:

Communicate to the community about what the plans are and that this is being worked on.
Get feedback from the community on the current plan.

While we’ll touch on the main points in the issue in this post, I would highly encourage you to go through it and give any feedback or questions about the plans you may have to help the ML.NET team in their planning or implementation.

The issue details three parts in order to deliver creating deep learning models in ML.NET:

Make consuming of ONNX models easier
Support TorchSharp and make it production ready
Create an API in ML.NET to support TorchSharp

Let’s go into each of these in more detail.

Easier Use of ONNX Models

While you can currently use ONNX models in ML.NET right now, you do have to know the input and output names in order to use it. Right now we rely on the Netron application to load the ONNX models to give us the input and output names. While this isn’t bad, the team wants to expose an internal way to get these instead of having to rely on a separate application.

Of course, along with the new way to get the input and output names for ONNX models, the documentation will definitely be updated to reflect this. I believe, not only documentation, but examples would follow to show how to do this.

Supporting TorchSharp

TorchSharp is the heart of how ML.NET will implement deep learning. Similar to how Tensorfow.NET supports scoring TensorFlow models in ML.NET, this will provide access to the PyTorch library in Python. PyTorch is starting to lead the way in building deep learning models in research and in industry so it makes sense to implmement in ML.NET.

In fact, one of the popular libraries to build deep learning models is FastAI. Not only is FastAI one of the best courses to take when learning deep learning, but the Python library is one of the best in terms of building deep learning models. Under the hood, though, FastAI uses PyTorch to actually build the models that it produces. This isn’t by accident. The FastAI developers decided that PyTorch was the way to go for this.

TensorFlow is great to support for predicting existing models, but for building new ones from scratch I really think PyTorch and TorchSharp is the preferred way. To do this, TorchSharp will help ML.NET lead the way.

Implementing TorchSharp into ML.NET

The final stage is, once TorchShap has been made production ready, create a high-level API in ML.NET to train deep learning models from scratch.

This will be like when Keras came along for TensorFlow. It was an API on top of TensorFlow to help make building the models much easier. I believe ML.NET can do that for TorchSharp.

This will probably be a big undertaking but definitely worth doing. This will be the API people will use to build their models so taking the time to get this the best way possible. will be worth it in the long run to let us build our models the most trivial way possible which will make us more productive in the long run.

Conclusion

Creating deep learning models from scratch is, by far, one of the most requested features for ML.NET and their plan to do this is definitely going to reach this goal. In fact, I think it will surpass this goal since it will use PyTorch on the backend which is where research and the industry is leaning towards.

If you have any feedback or questions, definitely feel free to comment on the GitHub issue.

AI Ethics and Fairness Resources

2021-08-25T00:00:00+00:00

AI and data ethics and fairness is becoming a very hot topic lately. With computer vision models not being able to see everyone equally to the debacle at Google’s AI division, it’s something that we all need to look out for when doing any type of work with data.

With that, I’d like to show some resources I found that has been useful when researching this topic. Some are videos that go over how bias can get into data and others are actual research papers that go over how to help mitigate bias.

For a video version of this post, check below:

Videos

There are quite a lot of videos that go over AI ethics. Below are a few of my favorites that have a good amount of information in them.

The Trouble with Bias by Kate Crawford - This talk, given at the Neural Neural Information Processing Systems (NIPS) in 2017. Not only does Kate goes over what exactly is bias in machine learning models, but she also goes over the harms that it can cause.
Machine Learning and Fairness by Hanna Wallach and Jennifer Wortman Vaughan - This is actually one of my favorite resources on the list. This video goes into several aspects of fairness in machine learning including types of bias that can be in your data as well as ways to help mitigate it such as the Datasheets for Data paper that’s linked in the papers section.
Transparency and Intelligibility Throughout the Machine Learning Life Cycle by Jennifer Wortman Vaughan - This goes through the entire machine learning life cycle to best incorporate transparency throughout the life cycle.

Courses

There are a couple of courses that go over AI ethics and I believe more will be on the way as time goes on.

FastAI Ethics - FastAI’s ethics course is probably one of the most comprehensive out there. It has several lectures and each lecture has supplemental materials such as articles and even research papers.

Books

Just like courses are coming to teach people about AI ethics, books are also coming to do the same and also to help how you can prevent bias from creeping into your models.

Interpretable AI by Ajay Thampi - One of the first books I’ve seen on this subject, this book helps you understand why the need for having models that are interpretable and shows how to do it.

Papers and Documents

A lot of the information in the other categories come from earlier research done on data bias and AI ethics. As a result of the research some documents have also come out of it to help people creating models to mitigate the amount of bias in their data.

Manipulating and Measuring Model Interpretability - This paper goes into how to measure model interpretability. It also helps answer the question about what is interpretability in terms of a machine learning model.
Datasheets for Datasets - In electronics, there is a datasheet accompanied by each component that describes its characteristics, any testing done on it, etc. This paper proposes the idea of having the same for machine learning data.
AI Fairness Checklist - This document has a checklist that one can follow throughout the lifecycle of creating a model to lookout for fairness.

Tools

Thankfully, there are some tools out there that can help us interpret how models are making their predictions as well as assessing fairness within the models.

Microsoft Fairlearn - This Python tool helps access the fairness in your data. There is a demo available for this that helps show how it works.
Microsoft InterpretML - Another Python tool to help interpret machine learning models. This one also has a demo available.

Hopefully, this list gave you a good idea about data and AI ethics and fairness. There are definitely many more resources out there and I have been partial to Microsoft for their research and resources.

There will be more posts on ethics and fairness in the future, as well. Especially covering the two tools from Microsoft, Fairlearn and InterpretML.

What’s New in ML.NET Version 1.6

2021-07-16T00:00:00+00:00

Another new release of ML.NET is now out! The release notes for version 1.6 has all the details, but this post will highlight all of the more interesting updates from this version. I’ll also include the pull request for each item in case you want to see more details on it or learn how something was implemented.

There were a lot of things added to this release, but they did make a note that there are no breaking changes from everything that was added.

For the video version of this post, check below.

Support for ARM

Perhaps the most exciting part of this update is the new support for ARM architectures. This will allow for most training and inference in ML.NET.

Why is this update useful? Well, ARM architectures are almost everywhere. As mentioned in the June update blog post this ARM architectures are included on mobile and embedded devices. This can open up a whole world of opportunities for ML.NET for mobile phones and IoT devices.

DataFrame Updates

The DataFrame API is probably one of the more exciting packages that’s currently in the early stages. Why? Well, .NET doesn’t have much in terms of competing with pandas in Python for data analysis or data wrangling to handle some preprocessing that you may need before you send the data into ML.NET to make a model.

Why am I including DataFrame updates in a ML.NET update? Well, the DataFrame API has been m oved into the ML.NET repository! The code used to be in the CoreFx Lab repository as an experimental package, but now it’s no longer experimental and now part of ML.NET. This is great news since it is planned to have many more updates to this API.

Other DataFrame updates include:

GroupBy operation extended - While the DataFrame API already had a GroupBy operation, this update adds new property groupings and makes it act more like LINQ’s GroupBy operation.
Improved CSV parsing - Implemented the TextFieldParser that can be used when loading a CSV file. This allows the handling of quotes in columns.
Convert IDataViewto DataFrame - We’ve already had a way to convert a DataFrame object into an IDataView to be able to use data loaded with the DataFrame API into ML.NET, but now we can do the opposite where we can load data in ML.NET and convert it into a DataFrame object to perform further analysis on it.
Improved DateTime parsing - This allows for better parsing of date time data.
Improvements to the Sort and Merge methods - These updates allow for better handling of null fields when performing a sort or merge.

By the way, if you’re looking for a way to help contribute to the ML.NET repository, helping with the DataFrame API is a great way to get involved. They have quite a few issues already that you can take a look at and help out with. It would be awesome if we got this package on par with pandas to help make C# a great ecosystem to perform data analysis.

You can use the Microsoft.Data.Analysis label on the issues to filter them out so you can see what all they need help with.

Code Enhancements

Quite a few of the enhancement updates were code quality updates. In fact, feiyun0112 did several pull requests that improved the code quality of the repo helping to make it easier for folks to read and maintain it.

Miscellaneous Updates

There were also quite a lot of updates that didn’t really tie in to a single theme. Here are some of the more interesting ones.

Saving Tensorflow models in the SavedModel format - Allows you to save Tensorflow models to use the SavedModel format instead of freezing the graph to save it.
Ability to specify a temp path - You can now specify the temp path location instead of it always going to the default location. This is specified in the MLContext.
Update LightGBM to version 2.3.1 - Using this new version can give better results when using the LightGBM algorithms.
Label column name suggestions in AutoML - If you may have mistyped the label column name when using the AutoML API, this update will give suggestions for fixing it.
Fixed several CI issues - Some tests would sometimes fail in the CI builds so these updates fixed their stability so that you can have more confidence in your pull request.
Updated doc for cross compiling on ARM - Adds a docker image that can be used.
Updated contribution doc with help wanted tags - Helps direct anyone looking to contribute on where they can find issues.

These are just a few of the changes in this release. Version 1.6 has a lot of stuff in it so I encourage you to go through the full release notes to see all the items that I didn’t include in this post.

What was your favorite update in this release? Was it ARM support or the new DataFrame enhancements? Let me know in the comments!

How the Machine Learning Process is Like Cooking

2021-05-17T00:00:00+00:00

When creating machine learning models it’s important to follow the machine learning process in order to get the best performing model that you can into production and to keep it performing well.

But why cooking? First, I enjoy cooking. But also, it is something we all do. Now, we all don’t make five course meals every day or aim to be a Michelin star chef. We do follow a process to make our food, though, even if it may be to just heat it up in the microwave.

In this post, I’ll go over the machine learning process and how it relates to cooking to give a better understanding of the process and maybe even a way to help remember the steps.

For the video version, check below:

Machine Learning Process

First, let’s briefly go over the machine learning process. Here’s a diagram that’s known as the cross-industry standard process for data mining, or simply known as CRISP-DM.

[caption id=”” align=”alignnone” width=”721”] Kenneth Jensen - Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1) [/caption]

The machine learning process is pretty straight forward when going through the diagram:

Business understanding - What exactly is the problem we are trying to solve with data
Data understanding - What exactly is in our data, such as what does each column mean and how does it relate to the business problem
Data prep - Data preprocessing and preparation. This can also include feature engineering
Modeling - Getting a model from our data
Evaluation - Evaluating the model for performance on generalized data
Deployment - Deploying the model for production use

Note that a couple of items can go back and forth. You may do multiple iterations of getting a better understanding of the business problem and the data, data prep and modeling, and even going back to the business problem when evaluating a model.

Notice that there’s a circle around the whole process which means you may even have to go back to understanding the problem once a model is deployed.

There are a couple of items I would add to help improve this process, though. First, we need to think about getting our data. Also, I believe can add a separate item in the process for improving our model.

Getting Data

I would actually add an item before or after defining the business problem, and that’s getting data. Sometimes you may have the data already and define the business problem but you may have to get the data after defining the problem. Either way, we need good data. You may have heard an old saying in programming, “Garbage in, garbage out”, and that applies to machine learning as well.

We can’t have a good model unless we give it good data.

Improving the Model

Once we have an algorithm we can also spend some time to improve it even further. We can deploy some techniques that can tweak the algorithm to perform better.

Now that we understand the machine learning process a bit better, let’s see how it relates to cooking.

Relating the Machine Learning Process to Cooking

At first glance, you may not see how the machine learning process relates to cooking at all. But let’s go into more detail of the machine learning process and how each step relates to cooking.

Business Understanding

One of the first things to do for the machine learning process is to get a business understanding of the problem.

For cooking, we know we want to make a dish, but which one? What do we want to accomplish with our dish? Is it for breakfast, lunch, or dinner? Is it for just us or do we want to create something for a family of four?

Knowing these will help us determine what we want to cook.

Getting Data

We can’t have a good model unless we give it good data.

Data Processing

Data processing is perhaps the most important step after getting good data. Depending on how you process the data will depend on how well your model performs.

For cooking, this is equivalent to preparing your ingredients. This includes chopping any ingredients such as vegetables, but keeping a consistent size when chopping also counts. This helps the pieces cook evenly. If some pieces are smaller they can burn or if some pieces are bigger then they may not be fully cooked.

Also, just like in machine learning there are multiple ways to process your data, there are also different ways to prepare ingredients. In fact, there’s a word for processing all of your ingredients before you start cooking - mise en place - which is French for “everything in it’s place”. This is done in cooking shows all the time where they have everything ready to start cooking.

This actually also makes sense for machine learning. We have to have all of our data processing done on the training data before we can give it to the machine learning algorithm.

Modeling

Now it’s time for the actual modeling part of the process where we give our data to an algorithm.

In cooking, this is actually where we cook our dish. In fact, we can relate choosing a recipe to choosing a machine learning algorithm. The recipe will take the ingredients and turn out a dish, and the algorithm will take the data and turn out a model.

Different recipes will turn out different dishes, though. Take a salad, for instance. Depending on the recipe and the ingredients, the salad can turn out to be bright and citrusy like this kale citrus salad. Or, it can be warm and savory like this spinach salad with bacon dressing.

They’re both salads, but they are turned into different kinds of salads because of different ingredients and recipes. In machine learning, you can have similar models from different data and algorithms.

What if you have the same ingredients? There are definitely different ways to make the same recipe. Hummus is traditionally made with chickpeas, tahini, garlic, and lemon like in this recipe. But there is also this hummus recipe that has the same ingredients but the recipe is just a bit different.

Optimizing the Model

Depending on the algorithm the machine learning model is using we can give it different parameters that can optimize the model for better performance. These parameter are called hyperparameters, which is used during the learning process of the model to your data.

These can be updated manually by you choosing values for the hyperparameters. This can be quite tedious and you never know what value to choose. So, instead, there are ways this can be automated by giving a range of values and running the model multiple times with different values and you can use the best performing model that is found.

How do we optimize a dish, though? Perhaps the best way to get the best taste out of your dish, other than using the best ingredients, is to season it. Specifically, seasoning with salt. In this video by Ethan Chlebowski, he suggests

…home cooks severely under salt the food they are cooking and is often why the food doesn’t taste good.

He even quotes this line from the book Ruhlman’s Twenty:

How to salt food is the most important skill to know in the kitchen.

I’ve even experienced this in my own cooking where I don’t add enough salt. Once I do, the dish tastes 100 times better.

Now, adding salt to your dish is the more manual way of optimizing it with seasoning. Is there a way that this can be automated? Actually, there is! Instead of using just salt and adding other spices to it yourself you can get these seasoning blends that has all the spices in it for you!

Evaluating the Model

Evaluating the model is going to be one of the most important steps because this tells you how well your model will perform on new data, or rather, data that it hasn’t seen before. During training your model have good performance, but giving it new data may reveal that it actually is performing bad.

Evaluating your cooked dish is a lot more fun, though. This is where you get to eat it! You will determine if it’s a good dish by how it tastes. Is it good or bad? If you served it to others, what did they think about it?

Iterating on the Model

Iterating on the model is a part of the process that may not seem necessary, but it can be an important one. Your data may change over time which would then make your model stale. That is, it’s relying on data that it used to but due to some process change or something similar it no longer does. And since the underlying data changed the model won’t predict as well as it did.

Similarly, you may have more or even better data that you can use for training, so you can then retrain the model with that to make better predictions.

How can you iterate on a dish that you just prepared? First thing is if it was good or bad. If it was bad, then we can revisit the recipe and see if we did anything wrong. Did we overcook it? Did we miss an ingredient? Did we prepare an ingredient incorrectly?

If it was good, then we can iterate on it

A lot of chefs and home cooks like to take notes about recipes they’ve made. They write in some tricks they’ve learned along the way but also some different paths from the recipe that they either had to take due to a missing ingredient or preferred to take.

Conclusion

Hopefully, this helps your better understand the machine learning process through the eyes of cooking a dish. It may even help you understand the importance of each step because, in cooking, if one step is missed then you probably won’t be having a good dinner tonight.

And if you’re wondering where does AutoML fit into all of this, then you can think of it as the meal delivery kits like Hello Fresh or Blue Apron. They do a lot of the work for you and you just have to put it all together.

What’s New in the Model Builder Preview

2021-05-10T00:00:00+00:00

The ML.NET graphical tool, Model Builder, continues to get better and better for everyone to work with and, most important, for everyone to get into machine learning. Recently, there have been some really good additions to Model Builder that we will go over in this post. We will go through the entire flow for Model Builder and will highlight each of the new items.

If you prefer to see a video of these updates, check the video below.

The team is testing out these new preview items so it currently needs to be opt-in through this Microsoft form in order to participate in it. Once you sign up there you will receive an email with instructions on how to install the preview version.

For even more information about this version of Model Builder and ML.NET version 1.5.5, checkout this Microsoft blog post.

The data for this post will be this NASA Asteroids Classification dataset. We will use this to determine if an asteroid can be potentially hazardous or not. That is, if it would come close to Earth enough to be a threat.

Perhaps the biggest addition to the preview version is the new Model Builder config file. Let’s look at this being used in action.

Once you have the preview version installed perform the same steps as usual to bring up Model Builder by right clicking on a project and selecting Add -> Machine Learning. It will now bring up a dialog for your Model Builder project.

Here we can give our Model Builder project a name. We’ll name it Asteroids and click to continue. Now the regular Model Builder window shows up, but if you look at the Solution explorer there was a new file added. It was that mbconfig file. We will look at what’s in this file later.

We can use Model Builder like usual through the first couple of steps. We’ll choose the Classification scenario and will train this locally. Then, we’ll add the file and this may take a few seconds since there’s a lot of data in here.

Once it’s loaded we can specify the label column, which will be the “Hazardous” column at the end.

Let’s now explore our updated data options that we get with this preview version. To get there, select the “Advanced data options” link below where you choose where the data is located. This opens a new dialog that shows how we can update the data options. These will be auto-filled based on what Model Builder determines from the data. If you want to override them, these options are available.

Note that there’s a small bug in the current version for dark theme of Visual Studio. I have created an issue to let the team know about it. For this section, I’ll use the light theme.

The first section, after the column names, is what purpose the column is. Is it a feature or a label? If it’s neither we can select to ignore the column.

The second section is what data type the column is. You can choose either a string, single (or float), or boolean.

The last section is a checkbox to tell Model Builder if this column is a categorical feature, meaning that there are a distinct number of string entries in there. Model Builder already determined that the “Orbiting body” column is categorical.

Also, notice that we can filter out the columns with the text field on the upper right. So if I wanted to see all the columns with “orbit” in the name I can just type that in and it will filter them out for me. This is definitely helpful for datasets that have a lot of features.

Compare this to what we had in the previous version. These new options give you the same thing, but they are now simpler and show more within the dialog.

The data formatting options haven’t changed, though. That’s where you can specify if the data has a header row, what the delimiter is, or specify if the decimals in the file use a dot (.) or a comma (,).

Now we can train our model. I’ll set the train time to be 20 seconds and fire it off to see how it goes.

Our top five models actually look pretty good. The top trainer has micro and macro accuracies at around 99%!

|                                 Top 5 models explored                                   |
-------------------------------------------------------------------------------------------
|     Trainer                          MicroAccuracy  MacroAccuracy  Duration #Iteration  |
|11   FastForestOva                      0.9980         0.9988       1.0         11       |
|12   FastTreeOva                        0.9960         0.9882       0.9         12       |
|9    FastTreeOva                        0.9960         0.9882       0.8          9       |
|0    FastTreeOva                        0.9960         0.9882       2.0          0       |
|10   LinearSvmOva                       0.9177         0.8709       2.5         10       |
-------------------------------------------------------------------------------------------

Let’s now go straight to the consume step. There’s a bit more information here than the previous version.

Here they give you the option to add consuming the model as projects within your current solution. Keep a watch on this page, though, as I’m sure more options will be coming. They also give you some sample data in which you can use to help test the consumption of your model.

Now, let’s take a moment and look again at our mbconfig file. In fact, you will notice a couple of more files here.

There are now consumption and training files that we can look at. These files are similar to the training and consuming projects that would get added to your solution but you don’t have to add them as separate projects if you don’t want.

By the way, if for any reason, we need to close the dialog and come back to it at another time to change the data options or increase the time to run we can double click on the mbconfig file to bring it back. This not only brings back the Model Builder dialog, but it also retains the state of it so we don’t have to do it all over again.

The reason for that is, if we open the mbconfig file in a JSON editor, it keeps track of everything in this file.

This keeps track of everything, even the history of all of the runs within Model Builder! And, since this is a JSON file, we can keep this in version control so teams can work on this together to get the best model they can.

Hopefully, this showed how much the team has done to help improve Model Builder. Definitely give feedback on their GitHub issues page for any issues or feature requests.

How to Build the ML.NET Repository

2021-05-03T00:00:00+00:00

Have you wanted to contribute a bug fix or a new feature to the ML.NET repository? The first step is to pull down the repository from GitHub and get it built successfully so you can start making changes.

The ML.NET repository has great documentation. Part of it is how to build it locally from this doc. In this post, we’ll go over the steps to do this so you can do the same and get started making changes to the ML.NET repository.

For a video version of this post, check below.

Fork the Repository

The first thing to do, if you haven’t already, is to fork the ML.NET repository.

If you haven’t forked the repository yet, you’re good to go to the next step. However, for me, since I have already forked the repository a while back, I need to make sure I have the latest.

There are two ways to do sync up my fork with the main repository - running git commands or letting GitHub do it for you.

Syncing the Fork

We can run some git commands to sync up. GitHub has good documentation on how to do this for a more detailed explanation.

The first thing is to is to make sure you have an upstream remote set up to point to the main repository.

To check if you have it you can run the git remote -v command. If there is only an origin remote then you would need to add an upstream remote that points to the original repository.

If you don’t have it set, this can be set with the following command.

git remote add upstream git@github.com:dotnet/machinelearning.git

Note that I have SSH set up so I use the SSH clone link. If you don’t have this set up you can use the HTTPS link instead.

After setting the upstream remote, we need to get the latest from the

git fetch upstream

Once the upstream fetched we can merge those changes into our fork. Make sure you’re in the default branch and run this command to merge in the changes.

git merge upstream/main

Now you can start working on the latest code base.

Note here that I attempted to use GitHub to sync my fork. Unfortunately, it seems to not do as good of a job as the git commands.

Install Dependencies

Before we can start to build the code, there is a dependency we need to install. This dependency is included with a git submodule.

If you run the build before this step you will get errors, so it’s best to do this before running the build.

To install the submodule dependencies, run the below command.

git submodule update --init

With the submodules installed we can now run the build through the command line.

Build on the Command Line

The build command is made very well in the ML.NET repository so there’s very little you have to do to actually run it. We can run this on the command line. The script you run will depend on if you use Windows or Linux/Mac.

For Windows, you would run build.cmd and for Mac/Linux you would run build.sh.

The first time you run it will take a while. It needs to download several assets, such as NuGet packages and models for testing. After you download all of this, though, subsequent builds will go much faster.

Build in Visual Studio

With the main build now complete we can now build within Visual Studio. Although, currently, you may get an error in the Microsoft.ML.Samples.GPU project.

Why do we get this error in Visual Studio and not when we ran the build on the command line? It turns out that Visual Studio was set to have compile errors on warnings. There are a couple of things you can do to fix this.

First, since this is a samples project, the simplest thing is to just comment out the method. Instead of doing that, though, we can update the build properties of the project. We can either set the “treat warnings as errors” to “None”.

Or, we can update the “Suppress warnings” to specify this specific warning. To get the warning we can go back and highlight the error with our cursor which will bring up a tooltip describing the error. It has a link to the CS0618 warning. We can put in the number in the “suppress warnings” section, 0618, and save the project.

Now we can fully build the solution in Visual Studio. Although, take note about this change when committing any other changes. You can either not include this change or include it and make a comment to discuss with the ML.NET team about it.

Hopefully, this post helps you get started to contribute to the ML.NET repository. If you make a contribution to the ML.NET repository, please let me know and we can celebrate!

About

2021-04-18T00:00:00+00:00

Just a regular developer in North Carolina trying to learn as much as I can and, in turn, to use this site to help share those things with everyone else.

I’ve tackled on many different technologies and areas of programming. I’ve started out as a .NET web developer to work my way to using more modern web technologies such as Angular and React. I’m currently pivoting into the world of data science where I’m learning about Python, statistics, machine learning, and data analysis.

I do my best to teach what I know here and on my YouTube channel.

Feel free to use the Contact form for any questions you may have when going through this site.

Your awesome title

What’s New in the ML.NET CLI

New Install Name

Train with a mbconfig File

Forecasting

Introduction to QnA Maker

What is Microsoft QnA Maker

Creating the Azure Resource

Creating a QnA Knowledge Base

Training and Testing

Adding a New QnA Pair

Using the API

Use Bing Image Search to Get Training Image Data

Why Bing Image Search

Automation

Image License

Image Type

Safe Content

Create Azure Resource

Retrieve Key and Endpoint

Using the API

Imports

The API Call

Building the Endpoint

Building the Headers and Query Parameters

Making the API Call

Paging Through Results

Downloading the Images

Cleaning the Image Data

Conclusion

The ML.NET Deep Learning Plans

Current State of Deep Learning in ML.NET

Future Deep Learning Plans

Easier Use of ONNX Models

Supporting TorchSharp

Implementing TorchSharp into ML.NET

Conclusion

AI Ethics and Fairness Resources

Videos

Courses

Books

Papers and Documents

Tools

What’s New in ML.NET Version 1.6

Support for ARM

DataFrame Updates

Code Enhancements

Miscellaneous Updates

How the Machine Learning Process is Like Cooking

Machine Learning Process

Getting Data

Improving the Model

Relating the Machine Learning Process to Cooking

Business Understanding

Getting Data

Data Processing

Modeling

Optimizing the Model

Evaluating the Model

Iterating on the Model

Conclusion

What’s New in the Model Builder Preview

How to Build the ML.NET Repository

Fork the Repository

Syncing the Fork

Install Dependencies

Build on the Command Line

Build in Visual Studio

About

Train with a `mbconfig` File