跳转到主要内容

热门内容

今日:


总体:


最近浏览:


Chinese, Simplified

category

Important

SharePoint Online indexer support is in public preview. It's offered "as-is", under Supplemental Terms of Use and supported on best effort only. Preview features aren't recommended for production workloads and aren't guaranteed to become generally available.

Be sure to visit the known limitations section before you start.

To use this preview, fill out this form. You won't be receiving any approval notification right after since any access request is automatically accepted after submission. After access is enabled, use a preview REST API to index your content.

This article explains how to configure a search indexer to index documents stored in SharePoint document libraries for full text search in Azure AI Search. Configuration steps are first, followed by behaviors and scenarios

Functionality

 

Azure AI Search中的索引器是一种从数据源中提取可搜索数据和元数据的爬虫。SharePoint Online索引器连接到您的SharePoint网站,并对一个或多个文档库中的文档进行索引。索引器提供以下功能:

  • 索引一个或多个文档库中的文件和元数据。
  • 增量索引,只拾取新的和更改的文件和元数据。
  • 删除检测是内置的。文档库中的删除将在下次运行索引器时被拾取,文档将从索引中删除。
  • 默认情况下,从索引的文档中提取文本和规范化图像。您可以选择添加一组技能,以更深入地丰富人工智能,如OCR或文本翻译。

Prerequisites

Supported document formats

The SharePoint Online indexer can extract text from the following document formats:

  • CSV (see Indexing CSV blobs)
  • EML
  • EPUB
  • GZ
  • HTML
  • JSON (see Indexing JSON blobs)
  • KML (XML for geographic representations)
  • Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
  • Open Document formats: ODT, ODS, ODP
  • PDF
  • Plain text files (see also Indexing plain text)
  • RTF
  • XML
  • ZIP

Limitations and considerations

Here are the limitations of this feature:

  • 不支持对SharePoint列表进行索引。
  • 正在为SharePoint建立索引。ASPX网站内容不受支持。
  • 不支持OneNote笔记本文件。
  • 不支持专用终结点。
  • 重命名SharePoint文件夹不会触发增量索引。重命名的文件夹被视为新内容。
  • SharePoint支持细粒度授权模型,该模型在文档级别确定每个用户的访问权限。索引器不会将这些权限拉入索引,Azure AI Search不支持文档级授权。当文档从SharePoint索引到搜索服务时,任何具有索引读取权限的人都可以访问该内容。如果您需要文档级权限,则应考虑使用安全过滤器来修剪结果,并自动将文件级的权限复制到索引中的字段。
  • 不支持对用户加密文件、受信息权限管理(IRM)保护的文件、带密码的ZIP文件或类似的加密内容进行索引。对于要处理的加密内容,对特定文件具有适当权限的用户必须删除加密,以便在索引器运行下一次计划迭代时对项目进行相应的索引。
  • 不支持从提供的特定站点递归索引子站点。

Here are the considerations when using this feature:

  • 如果您在生产环境中需要SharePoint内容索引解决方案,请考虑使用SharePoint Webhook创建自定义连接器,调用Microsoft Graph API将数据导出到Azure Blob容器,然后使用Azure Blob索引器进行增量索引。
  • 如果您的SharePoint配置允许Microsoft 365进程更新SharePoint文件系统元数据,请注意,这些更新可能会触发SharePoint Online索引器,导致索引器多次摄取文档。由于SharePoint Online索引器是Azure的第三方连接器,因此索引器无法读取配置或更改其行为。它对新内容和更改内容的更改做出响应,无论这些更新是如何进行的。因此,在使用索引器和任何AI增强之前,请确保测试您的设置并了解文档处理计数。

Configure the SharePoint Online indexer

To set up the SharePoint Online indexer, use both the Azure portal and a preview REST API. You can use 2020-06-30-preview or later. We recommend the latest preview API.

This section provides the steps. You can also watch the following video.

Step 1 (Optional): Enable system assigned managed identity

Enable a system-assigned managed identity to automatically detect the tenant the search service is provisioned in.

Perform this step if the SharePoint site is in the same tenant as the search service. Skip this step if the SharePoint site is in a different tenant. The identity isn't used for indexing, just tenant detection. You can also skip this step if you want to put the tenant ID in the connection string.

Screenshot showing how to enable system assigned managed identity.

After selecting Save, you get an Object ID that has been assigned to your search service.

Screenshot the object identifier.

Step 2: Decide which permissions the indexer requires

The SharePoint Online indexer supports both delegated and application permissions. Choose which permissions you want to use based on your scenario.

We recommend app-based permissions. See limitations for known issues related to delegated permissions.

  • 应用程序权限(推荐),其中索引器以SharePoint租户的身份运行,可以访问所有网站和文件。索引器需要客户端机密。索引器还需要租户管理员批准才能索引任何内容。
  • 委托权限,其中索引器以发送请求的用户或应用程序的身份运行。数据访问仅限于调用者可以访问的站点和文件。为了支持委派权限,索引器需要一个设备代码提示来代表用户登录。根据用于实现此身份验证类型的最新安全库,用户委托权限每75分钟强制令牌过期一次。这不是一种可以调整的行为。过期的令牌需要使用Run Indexer(预览版)进行手动索引。因此,您可能需要基于应用程序的权限。

If your Microsoft Entra organization has conditional access enabled and your administrator isn't able to grant any device access for delegated permissions, you should consider app-based permissions instead. For more information, see Microsoft Entra Conditional Access policies.

 

Step 3: Create a Microsoft Entra application registration

The SharePoint Online indexer uses this Microsoft Entra application for authentication.

  1. Sign in to the Azure portal.

  2. Search for or navigate to Microsoft Entra ID, then select App registrations.

  3. Select + New registration:

    1. Provide a name for your app.
    2. Select Single tenant.
    3. Skip the URI designation step. No redirect URI required.
    4. Select Register.
  4. On the left, select API permissions, then Add a permission, then Microsoft Graph.

    • If the indexer is using application API permissions, then select Application permissions and add the following:

      • Application - Files.Read.All
      • Application - Sites.Read.All

      Screenshot of application API permissions.

      Using application permissions means that the indexer accesses the SharePoint site in a service context. So when you run the indexer it will have access to all content in the SharePoint tenant, which requires tenant admin approval. A client secret is also required for authentication. Setting up the client secret is described later in this article.

    • If the indexer is using delegated API permissions, select Delegated permissions and add the following:

      • Delegated - Files.Read.All
      • Delegated - Sites.Read.All
      • Delegated - User.Read

      Screenshot showing delegated API permissions.

      Delegated permissions allow the search client to connect to SharePoint under the security identity of the current user.

  5. Give admin consent.

    Tenant admin consent is required when using application API permissions. Some tenants are locked down in such a way that tenant admin consent is required for delegated API permissions as well. If either of these conditions apply, you’ll need to have a tenant admin grant consent for this Microsoft Entra application before creating the indexer.

    Screenshot showing Microsoft Entra app grant admin consent.

  6. Select the Authentication tab.

  7. Set Allow public client flows to Yes then select Save.

  8. Select + Add a platform, then Mobile and desktop applications, then check 

    https://login.microsoftonline.com/common/oauth2/nativeclient, then Configure.

    Screenshot showing Microsoft Entra app authentication configuration.

  9. (Application API Permissions only) To authenticate to the Microsoft Entra application using application permissions, the indexer requires a client secret.

    • Select Certificates & Secrets from the menu on the left, then Client secrets, then New client secret.

      Screenshot showing new client secret.

    • In the menu that pops up, enter a description for the new client secret. Adjust the expiration date if necessary. If the secret expires, it needs to be recreated and the indexer needs to be updated with the new secret.

      Screenshot showing how to set up a client secret.

    • The new client secret appears in the secret list. Once you navigate away from the page, the secret is no longer be visible, so copy it using the copy button and save it in a secure location.

      Screenshot showing where to copy a client secret.

 

Step 4: Create data source

Starting in this section, use a preview REST API for the remaining steps. We recommend the latest preview API.

A data source specifies which data to index, credentials, and policies to efficiently identify changes in the data (new, modified, or deleted rows). A data source can be used by multiple indexers in the same search service.

For SharePoint indexing, the data source must have the following required properties:

  • name is the unique name of the data source within your search service.
  • type must be "sharepoint". This value is case-sensitive.
  • credentials provide the SharePoint endpoint and the Microsoft Entra application (client) ID. An example SharePoint endpoint is 
  • https://microsoft.sharepoint.com/teams/MySharePointSite. You can get the endpoint by navigating to the home page of your SharePoint site and copying the URL from the browser.
  • container specifies which document library to index. Properties control which documents are indexed.

To create a data source, call Create Data Source (preview).

HTTP
POST https://[service name].search.windows.net/datasources?api-version=2024-05-01-preview
Content-Type: application/json
api-key: [admin key]

{
    "name" : "sharepoint-datasource",
    "type" : "sharepoint",
    "credentials" : { "connectionString" : "[connection-string]" },
    "container" : { "name" : "defaultSiteLibrary", "query" : null }
}

Connection string format

The format of the connection string changes based on whether the indexer is using delegated API permissions or application API permissions

  • Delegated API permissions connection string format

    SharePointOnlineEndpoint=[SharePoint site url];ApplicationId=[Azure AD App ID];TenantId=[SharePoint site tenant id]

  • Application API permissions connection string format

    SharePointOnlineEndpoint=[SharePoint site url];ApplicationId=[Azure AD App ID];ApplicationSecret=[Azure AD App client secret];TenantId=[SharePoint site tenant id]

Note

If the SharePoint site is in the same tenant as the search service and system-assigned managed identity is enabled, TenantId doesn't have to be included in the connection string. If the SharePoint site is in a different tenant from the search service, TenantId must be included.

Step 5: Create an index

The index specifies the fields in a document, attributes, and other constructs that shape the search experience.

To create an index, call Create Index (preview):

HTTP
POST https://[service name].search.windows.net/indexes?api-version=2024-05-01-preview
Content-Type: application/json
api-key: [admin key]

{
    "name" : "sharepoint-index",
    "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "metadata_spo_item_name", "type": "Edm.String", "key": false, "searchable": true,
         "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_path", "type": "Edm.String", "key": false, "searchable": false, 
        "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_content_type", "type": "Edm.String", "key": false, "searchable": false, 
        "filterable": true, "sortable": false, "facetable": true },
        { "name": "metadata_spo_item_last_modified", "type": "Edm.DateTimeOffset", "key": false, "searchable": false, 
        "filterable": false, "sortable": true, "facetable": false },
        { "name": "metadata_spo_item_size", "type": "Edm.Int64", "key": false, "searchable": false, "filterable": false, 
        "sortable": false, "facetable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
    ]
}

Important

Only metadata_spo_site_library_item_id may be used as the key field in an index populated by the SharePoint Online indexer. If a key field doesn't exist in the data source, metadata_spo_site_library_item_id is automatically mapped to the key field.

Step 6: Create an indexer

An indexer connects a data source with a target search index and provides a schedule to automate the data refresh. Once the index and data source are created, you can create the indexer.

If you are using delegated permissions, during this step, you’re asked to sign in with organization credentials that have access to the SharePoint site. If possible, we recommend creating a new organizational user account and giving that new user the exact permissions that you want the indexer to have.

There are a few steps to creating the indexer:

  1. Send a Create Indexer (preview) request:

    HTTP
    POST https://[service name].search.windows.net/indexers?api-version=2024-05-01-preview
    Content-Type: application/json
    api-key: [admin key]
    
    {
        "name" : "sharepoint-indexer",
        "dataSourceName" : "sharepoint-datasource",
        "targetIndexName" : "sharepoint-index",
        "parameters": {
        "batchSize": null,
        "maxFailedItems": null,
        "maxFailedItemsPerBatch": null,
        "base64EncodeKeys": null,
        "configuration": {
            "indexedFileNameExtensions" : ".pdf, .docx",
            "excludedFileNameExtensions" : ".png, .jpg",
            "dataToExtract": "contentAndMetadata"
          }
        },
        "schedule" : { },
        "fieldMappings" : [
            { 
              "sourceFieldName" : "metadata_spo_site_library_item_id", 
              "targetFieldName" : "id", 
              "mappingFunction" : { 
                "name" : "base64Encode" 
              } 
             }
        ]
    }
    

    If you're using application permissions, it's necessary to wait until the initial run is complete before starting to query your index. The following instructions provided in this step pertain specifically to delegated permissions, and are not applicable to application permissions.

  2. When you create the indexer for the first time, the Create Indexer (preview) request waits until you complete the next step. You must call Get Indexer Status to get the link and enter your new device code.

    HTTP
    GET https://[service name].search.windows.net/indexers/sharepoint-indexer/status?api-version=2024-05-01-preview
    Content-Type: application/json
    api-key: [admin key]
    

    If you don’t run the Get Indexer Status within 10 minutes, the code expires and you’ll need to recreate the data source.

  3. Copy the device login code from the Get Indexer Status response. The device login can be found in the "errorMessage".

    HTTP
    {
        "lastResult": {
            "status": "transientFailure",
            "errorMessage": "To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code 
            <CODE> to authenticate."
        }
    }
    
  4. Provide the code that was included in the error message.

    Screenshot showing how to enter a device code.

  5. The SharePoint Online indexer will access the SharePoint content as the signed-in user. The user that logs in during this step will be that signed-in user. So, 

    if you sign in with a user account that doesn’t have access to a document in the Document Library that you want to index, the indexer won’t have access to that document.

    If possible, we recommend creating a new user account and giving that new user the exact permissions that you want the indexer to have.

  6. Approve the permissions that are being requested.

    Screenshot showing how to approve API permissions.

  7. The Create Indexer (preview) initial request completes if all the permissions provided above are correct and within the 10 minute timeframe.

Note

If the Microsoft Entra application requires admin approval and was not approved before logging in, you may see the following screen. Admin approval is required to continue. Screenshot showing admin approval required.

Step 7: Check the indexer status

After the indexer has been created, you can call Get Indexer Status:

HTTP
GET https://[service name].search.windows.net/indexers/sharepoint-indexer/status?api-version=2024-05-01-preview
Content-Type: application/json
api-key: [admin key]

Updating the data source

If there are no updates to the data source object, the indexer runs on a schedule without any user interaction.

However, if you modify the data source object while the device code is expired, you must sign in again in order for the indexer to run. For example, if you change the 

data source query, sign in again using the https://microsoft.com/devicelogin and get the new device code.

Here are the steps for updating a data source, assuming an expired device code:

  1. Call Run Indexer (preview) to manually start indexer execution.

    HTTP
    POST https://[service name].search.windows.net/indexers/sharepoint-indexer/run?api-version=2024-05-01-preview  
    Content-Type: application/json
    api-key: [admin key]
    
  2. Check the indexer status.

    HTTP
    GET https://[service name].search.windows.net/indexers/sharepoint-indexer/status?api-version=2024-05-01-preview
    Content-Type: application/json
    api-key: [admin key]
    
  3. If you get an error asking you to visit https://microsoft.com/devicelogin, open the page and copy the new code.

  4. Paste the code into the dialog box.

  5. Manually run the indexer again and check the indexer status. This time, the indexer run should successfully start.

 

Indexing document metadata

If you're indexing document metadata ("dataToExtract": "contentAndMetadata"), the following metadata will be available to index.

Identifier Type Description
metadata_spo_site_library_item_id Edm.String The combination key of site ID, library ID, and item ID, which uniquely identifies an item in a document library for a site.
metadata_spo_site_id Edm.String The ID of the SharePoint site.
metadata_spo_library_id Edm.String The ID of document library.
metadata_spo_item_id Edm.String The ID of the (document) item in the library.
metadata_spo_item_last_modified Edm.DateTimeOffset The last modified date/time (UTC) of the item.
metadata_spo_item_name Edm.String The name of the item.
metadata_spo_item_size Edm.Int64 The size (in bytes) of the item.
metadata_spo_item_content_type Edm.String The content type of the item.
metadata_spo_item_extension Edm.String The extension of the item.
metadata_spo_item_weburi Edm.String The URI of the item.
metadata_spo_item_path Edm.String The combination of the parent path and item name.

The SharePoint Online indexer also supports metadata specific to each document type. More information can be found in Content metadata properties used in Azure AI Search.

Note

To index custom metadata, "additionalColumns" must be specified in the query parameter of the data source.

Include or exclude by file type

You can control which files are indexed by setting inclusion and exclusion criteria in the "parameters" section of the indexer definition.

Include specific file extensions by setting "indexedFileNameExtensions" to a comma-separated list of file extensions (with a leading dot). Exclude specific file extensions by setting "excludedFileNameExtensions" to the extensions that should be skipped. If the same extension is in both lists, it's excluded from indexing.

HTTP
PUT /indexers/[indexer name]?api-version=2024-05-01-preview
{
    "parameters" : { 
        "configuration" : { 
            "indexedFileNameExtensions" : ".pdf, .docx",
            "excludedFileNameExtensions" : ".png, .jpeg" 
        } 
    }
}

 

Controlling which documents are indexed

A single SharePoint Online indexer can index content from one or more document libraries. Use the "container" parameter on the data source definition to indicate 

which sites and document libraries to index from.

The data source "container" section has two properties for this task: "name" and "query".

Name

The "name" property is required and must be one of three values:

Value Description
defaultSiteLibrary Index all content from the site's default document library.
allSiteLibraries Index all content from all document libraries in a site. Document libraries from a subsite are out of scope/ If you need content from subsites, choose "useQuery" and specify "includeLibrariesInSite".
useQuery Only index the content defined in the "query".

 

Query

The "query" parameter of the data source is made up of keyword/value pairs. The below are the keywords that can be used. The values are either site URLs or

 document library URLs.

Note

To get the value for a particular keyword, we recommend navigating to the document library that you’re trying to include/exclude and copying the URI from the 

browser. This is the easiest way to get the value to use with a keyword in the query.

Keyword Value description and examples
null If null or empty, index either the default document library or all document libraries depending on the container name.

Example:

"container" : { "name" : "defaultSiteLibrary", "query" : null }
includeLibrariesInSite

Index content from all libraries under the specified site in the connection string. The value should be the URI of the site or subsite.

Example 1:

"container" : { "name" : "useQuery", "query" : "includeLibrariesInSite=https://mycompany.sharepoint.com/mysite" }

Example 2 (include a few subsites only):

"container" : { "name" : "useQuery", "query" : "includeLibrariesInSite=https://mycompany.sharepoint.com/sites/TopSite/SubSite1;

includeLibrariesInSite=https://mycompany.sharepoint.com/sites/TopSite/SubSite2" }

includeLibrary

Index all content from this library. The value is the fully qualified path to the library, which can be copied from your browser:

Example 1 (fully qualified path):

"container" : { "name" : "useQuery", "query" : "includeLibrary=https://mycompany.sharepoint.com/mysite/MyDocumentLibrary" }

Example 2 (URI copied from your browser):

"container" : { "name" : "useQuery", "query" : "includeLibrary=

https://mycompany.sharepoint.com/teams/mysite/MyDocumentLibrary/Forms/AllItems.aspx" }

excludeLibrary Don't index content from this library. The value is the fully qualified path to the library, which can be copied from your browser:

Example 1 (fully qualified path):

"container" : { "name" : "useQuery", "query" : "includeLibrariesInSite=https://mysite.sharepoint.com/subsite1; excludeLibrary=https://mysite.sharepoint.com/subsite1/MyDocumentLibrary" }

Example 2 (URI copied from your browser):

"container" : { "name" : "useQuery", "query" : "includeLibrariesInSite=https://mycompany.sharepoint.com/teams/mysite; excludeLibrary=https://mycompany.sharepoint.com/teams/mysite/MyDocumentLibrary/Forms/AllItems.aspx" }
additionalColumns

Index columns from the document library. The value is a comma-separated list of column names you want to index. Use a double 

backslash to escape semicolons and commas in column names:

Example 1 (additionalColumns=MyCustomColumn,MyCustomColumn2):

"container" : { "name" : "useQuery", "query" : "includeLibrary=https://mycompany.sharepoint.com/mysite/MyDocumentLibrary;additionalColumns=MyCustomColumn,MyCustomColumn2" }

Example 2 (escape characters using double backslash):

"container" : { "name" : "useQuery", "query" : 

"includeLibrary=https://mycompany.sharepoint.com/teams/mysite/MyDocumentLibrary/Forms/AllItems.aspx;

additionalColumns=MyCustomColumnWith\\,,MyCustomColumnWith\\;" }

Handling errors

By default, the SharePoint Online indexer stops as soon as it encounters a document with an unsupported content type (for example, an image). You can use

 the excludedFileNameExtensions parameter to skip certain content types. However, you might need to index documents without knowing all the possible 

content types in advance. To continue indexing when an unsupported content type is encountered, set the failOnUnsupportedContentType configuration parameter to false:

HTTP
PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2024-05-01-preview
Content-Type: application/json
api-key: [admin key]

{
    ... other parts of indexer definition
    "parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
}

For some documents, Azure AI Search is unable to determine the content type, or unable to process a document of otherwise supported content type. To 

ignore this failure mode, set the failOnUnprocessableDocument configuration parameter to false:

HTTP
"parameters" : { "configuration" : { "failOnUnprocessableDocument" : false } }

Azure AI Search limits the size of documents that are indexed. These limits are documented in Service Limits in Azure AI Search. Oversized documents are 

treated as errors by default. However, you can still index storage metadata of oversized documents if you set indexStorageMetadataOnlyForOversizedDocuments 

configuration parameter to true:

HTTP
"parameters" : { "configuration" : { "indexStorageMetadataOnlyForOversizedDocuments" : true } }

You can also continue indexing if errors happen at any point of processing, either while parsing documents or while adding documents to an index. To ignore a 

specific number of errors, set the maxFailedItems and maxFailedItemsPerBatch configuration parameters to the desired values. For example:

HTTP
{
    ... other parts of indexer definition
    "parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 10 }
}

If a file on the SharePoint site has encryption enabled, an error message similar to the following may be encountered:

 
Code: resourceModified Message: The resource has changed since the caller last read it; usually an eTag mismatch Inner error: 
Code: irmEncryptFailedToFindProtector

The error message will also include the SharePoint site ID, drive ID, and drive item ID in the following pattern: <sharepoint site id> :: <drive id> :: <drive item id>. This information can be used to identify which item is failing on the SharePoint end. The user can then remove the encryption from the item to resolve the issue.

See also

本文地址
最后修改
星期五, 八月 9, 2024 - 22:30
Article