How to extract text within a string of text

2019-01-11 13:52发布

I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.

I have a file with multiple rows (all one column). Each row has data that looks something like this:

1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom

2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate

etc...

What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.

What I would hope the output would look like would be something like:

297848936,297338191

168010496

etc...

4条回答
We Are One
2楼-- · 2019-01-11 14:15

As the other guys presented the solution without VBA... I'll present the one that does use. Now, is your call to use it or no.

Just saw that @Issun presented the solution with regex, very nice! Either way, will present a 'modest' solution for the question, using only 'plain' VBA.

Option Explicit
Option Base 0

Sub findGi()

    Dim oCell As Excel.Range
    Set oCell = Sheets(1).Range("A1")

    'Loops through every row until empty cell
    While Not oCell.Value = ""

        oCell.Offset(0, 1).Value2 = GetGi(oCell.Value)
        Set oCell = oCell.Offset(1, 0)

    Wend

End Sub

Private Function GetGi(ByVal sValue As String) As String

    Dim sResult As String
    Dim vArray As Variant
    Dim vItem As Variant
    Dim iCount As Integer

    vArray = Split(sValue, "|")
    iCount = 0

    'Loops through the array...
    For Each vItem In vArray

        'Searches for the 'Gi' factor...
        If vItem Like "*gi" And UBound(vArray) > iCount + 1 Then

            'Concatenates the results...
            sResult = sResult & vArray(iCount + 1) & ","

        End If

        iCount = iCount + 1

    Next vItem

    'And removes trail comma
    If Len(sResult) > 0 Then

        sResult = Left(sResult, Len(sResult) - 1)

    End If

    GetGi = sResult

End Function
查看更多
Summer. ? 凉城
3楼-- · 2019-01-11 14:22

Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/

You would call it like this, assuming that first string is in A1:

=RegexExtract(A1,"gi[|](\d+)[|]")

Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:

297848936, 297338191

Just run this down the column and you're all done!

Function RegexExtract(ByVal text As String, _
                      ByVal extract_what As String, _
                      Optional separator As String = ", ") As String

Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String

RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)

For i = 0 To allMatches.count - 1
    For j = 0 To allMatches.Item(i).submatches.count - 1
        result = result & (separator & allMatches.Item(i).submatches.Item(j))
    Next
Next

If Len(result) <> 0 Then
    result = Right$(result, Len(result) - Len(separator))
End If

RegexExtract = result

End Function
查看更多
小情绪 Triste *
4楼-- · 2019-01-11 14:32

I'd probably split the data first on the | delimiter using the convert text to columns wizard. In Excel 2007 that is on the Data tab, Data Tools group and then choose Text to Columns. Specify Other: and | as the delimiter.

From the sample data you posted it looks like after you do this the numbers will all be in the same columns so you could then just delete the columns you don't want.

查看更多
不美不萌又怎样
5楼-- · 2019-01-11 14:37

Here it is (assuming data is in column A)

=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2),
FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 ))

Not the nicest formula, but it will work to extract the number.

I just noticed since you have two values per row with output separated by commas. You will need to check if there is a second match, third match etc. to make it work for multiple numbers per cell.

In reference to your exact sample (assuming 2 values maximum per cell) the following code will work:

=IF(ISNUMBER(FIND("gi|",$A1,FIND("gi|", $A1)+1)),CONCATENATE(LEFT(RIGHT($A1,LEN($A1)
- FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ), 
", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) 
- 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2)) 
-1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),
FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ))

How's that for ugly? A VBA solution may be better for you, but I'll leave this here for you.

To go up to 5 numbers, well, study the pattern and recurse manually in the formula. IT will get long!

查看更多
登录 后发表回答